>> Zhengyou Zhang: Good evening, ladies and gentlemen. Welcome to our lecture. This lecture is sponsored by Technical Society. Talk about what's hot, what's best, what's going on in the industry, and so we are so delighted to have Dr. Li Deng with us tonight, and Li is our engineer of the year this year, 2015, for Region 6. And you will be amazed how much he has achieved, and he has 60 patents, international patents, four books and 300 more to 400 papers, technical papers, really great papers. And he has a lot of honors and awards. He's 2013 IEEE SPS Best Paper Award, Editor in Chief, IEEE Transactions, Audio, Speech and Language Processing, Editor in Chief, IEEE Signal Processing Magazines, Technology Transfer Award, Gold Star Award 2011, IEEE SPS Meritorious Service Award, General Chair 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, and Board of Governors of IEEE Signal Processing Society, Board of Governors and Vice President of Industrial Relations, Asia-Pacific Signal and Information Processing Association, Publication Board, IEEE Signal Processing Society, keynote speaker, Interspeech, September 18, 2014, and he is a keynote speaker at 12 China national conferences on computational linguistics and so many keynotes, I just don't want to say it anymore. And then he is just an amazing person, and so I think you will enjoy this evening. >> Li Deng: Thank you very much for the introduction earlier, before you came, yes. So today's topic, I originally put the topic called recent advances of deep learning, so many of my colleagues said that, well, I should take the opportunity to talk about not only the recent advances, but also maybe some history of how Microsoft got into this area of the research. And this is a very hard area now. People talk about artificial intelligence, machine learning, deep learning. Almost every single day, you get news coming out from the media, so that happened actually over the last three or four years, maybe two or three years, so the more recent ones, you get a lot of acquisitions, small company acquisitions. Microsoft doesn't acquire -- hasn't acquired any companies. Only Google acquired a lot of companies in this area, so going through this lecture, you are going to see why. So I'm going to provide a selective overview of a whole bunch of projects done at Microsoft for deep learning, and I'm going to show you exactly how, historically, how deep learning got started from here and how it was advanced and then how the whole industry actually is working in this area now in terms of state of the art. So I first of all want to acknowledge many of my colleagues at our Deep Learning Center at Microsoft Research and also with the collaborating universities and many of Microsoft engineering groups, some of whom are probably sitting here, who are collaborating with us to create some impact of some project. So almost all the material that I will talk about today are public. We either put those in the CNN or New York Times. All this, this is a summary of a whole bunch of things that we put together, and I'm going to point you to some of the new activities we are planning to do -- not much detail, because they are not public yet, but I'm going to say whatever I feel comfortable about saying. Okay, so first of all, so the first real topic is the how and when and why Microsoft got into deep learning, and I'll start with deep speech recognition. So this is actually very early, early work that we started doing with academic pioneers. So, for example, there's no laser pointer here. Okay, but it's okay. So I probably can just talk about this. So officially, Microsoft got into deep learning research approximately towards the end of 2009. So we worked with Professor Geoff Hinton at University of Toronto. Actually, the collaboration started earlier than this. We actually worked right before this one, and he traveled to this building, actually in my office. We worked together on this in what's called the deep learning for speech recognition. And that -- actually, there's a long story about how it led to this workshop. So during the workshop, there was an extremely high interest in doing deep learning for speech recognition, and then he consulted with university academics of research coming over here, and we learned a great deal from them. And then Li Deng, he sent a couple students to work with us, so we devoted a fair amount of resources into the collaboration, and then we actually make the whole thing work well. So now, every single voice interaction system in the world, fair to say, including Microsoft, Google's, Amazon's, is using this technology. So I'm going to tell you how this whole thing started as part of the presentation. So the track, so we started doing somewhere around 2009, and we worked quite a number of years in collaboration with academics, and most of the work done is right in this building. And we created this technology called the deep learning speech recognition. That was the first time demoed by my boss, Zhengyou's boss, our boss, Rick Rashid. Now he retired, after he gave this talk -- soon after he gave this talk. He was actually in China giving his talk, and then that story was featured in the New York Times in 2012, and John Markoff was the reporter, and he interviewed me right in this building, as well as with Geoff Hinton, and the story is very, very amazing, and this is the real-time demo system, where Rick Rashid spoke English. He doesn't speak Chinese -- and then it's automatically recognized in English and gets translated into Chinese and using Chinese language synthesizer to speak aloud, and it sounds like him. So many of you, my colleagues way over there and then people saw that, students -- this is a big students of 3,000 people, and they shed tears. They said, wow, now they can talk with their mother-in-law easily, and this is a joke, right? A lot of Western people, they marry with Japanese, they couldn't speak with their -- and now, they were able to speak without any difficulty. And that happens in 2012. So now the underlying technology used in this demo actually is what's called deep neural networks, so we call it context-dependent deep neural network hidden Markov model. It's all technical, and they were invented about two years ago through our collaboration with University of Toronto. And this is the very important figure here, so this is number of years. So if you sit in the back, you won't be able to see that. This is 1999, and this is 2012, so this is about the time. So this is a speechrecognition error rate. So in 1993, when people started evaluating this word error rate for spontaneous speech recognition, just like the one I'm talking about, error is 100% almost. Everything is wrong. That's about 20 years ago. And then people are working very hard. So every single year, the error drops. Look at how much people -- so this kind of evaluation was sponsored by DARPA. They put tons of money to individual groups, including companies. IBM is one of them. Many universities have participated. And then whoever got the best result, they got funding next year. It's a very stressful test. So if they're not doing well, they get cut off, so that's why it drops so much. So every single year, it comes down here. Now, roughly about 2,000, things don't change anymore, so error stays the same, about 20%-some of error. So it's actually stayed about the same for 10 years. So in 2009 and 2010, so when we first started doing deep learning in collaboration with Toronto, error rate dropped from like 20% something to 15% some. And then for Rick Rashid's demo, that was two years after that, and the error dropped to 7%, and every time, most of the things he said got recognized correctly. Look at how much it moved from here to here, 10 years, it's just dropping down here. Essentially, error dropped more than half for this, so this became usable. And the underlying technologies, I don't have time to go through all the technical details about how this deep learning actually works here, except to let you know that the deep learning essentially is to extract speech information and put them into layer-by-layer representation in the neural network. At the end, we used about up to 12, 13 layers. >>: Can you give a basic definition of DNN. Is deep learning the same as deep neural networks? >> Li Deng: So for this talk, maybe you can think about it that way. So deep neural networks, in speech recognition, maybe 90% of the work is done by deep neural networks, and there are some additional new advancements of deep learning, like recurrent neural networks, like memory network, like deep generative model, like also convolution neural net, which are a little different from DNN. So think about DNN to be maybe 70%, 60% of deep learning, and the other ones I don't have time to go through. Well, in the image recognition, I'll probably say a few words about some more different type of network, but they follow similar structure, so the features that you have layer by layer representation and having the information from bottom up all the way to the high level, to allow you to do very effective discrimination, pattern recognition, which is the same. So now, for image net, I'm going to go through later on. The number of layers goes to about 60 layers. It's amazing. People just keep getting better and better results, and speech, roughly 10 to 15 layers are good enough, but if you have one single layer, you hardly get any good results. So those are the main advances achieved by Microsoft Research during about this year. And that led to -- so in 2013, MIT Technology Review, and this is a very reputable technology journal, they put deep learning to be top 1 out of 10 breakthrough technologies, and one of the reasons they put deep learning to be here precisely is because they cited this translation speech in real time from our boss, Rick Rashid's, demo. So that's a very influential -- that's 2012 and 2013. So now, I want to tell you a little bit about how this invention gets spread out into the entire industry. Now, MSR is doing that. MSR is pretty late. They hired a few people from our group. Google, actually -- so Professor Geoff Hinton is one, I have to credit him a great deal. He's really a pioneer from university. Now, Google acquired his company without actually coming, just work himself and two students. That happened in about 2013, and we are trying to do that as well. We didn't bid high enough. But, anyways, so there are a lot of interesting stories. He is the one who actually gets together with all the 10 authors, including IBM people, Google people, and then his graduate students and myself and my colleagues and together to write up this paper. So this paper, just about two years ago, we get about 600 citations, a very highly cited paper. Essentially, it becomes still the state of the art for this, so he managed to get shared views of four research groups, including Google. So this is the only time competitors get together without patent fight, and then without any argument. We all agree on every single word we wrote. It's about 20 pages of paper, so we coordinated. Actually, he and I spent a lot of time coordinating all this, everybody is happy about the conclusions. So this one becomes a very interesting kind of original paper that many people follow up. And, of course, so once you do that, the larger-scale speculation successfully, and everybody follows. And that includes Google now. There's our Skype Translator and Windows Phone, Xbox. All the voice here is using the same technology I showed you. The Siri, they're using that -- my boss got hired by Apple about two years ago, started doing all of this. And then this is Baidu. Basically, all single companies, major companies, they are using the same technologies, so this is very, very impactful. And also, university labs and DARPA programs, now, every single team in that group, they used that technology, so this is a -- and I'm very proud to say that they originally started in this building, right on the third floor here. Okay, so this is an article from Businessweek. Everybody knows about it. So the race to buy human brains behind deep learning machines, and that was about one year ago. So our boss, new boss, Peter Lee, actually mentioned some very interesting work here. This is a fabulous article that I use to recruit people. So he says, according to him, my boss, so I actually check with him whether he really made that. He told me kind of. So including Microsoft, Google, so Google actually spent about $400 million, purchased this company about one year ago, and we didn't do that. He said that the Google find themselves in a battle for the talents. Microsoft gone from four full time, about full-time employees, about four people that were working together in the time that I talked about, going to about 70 about one year ago, so now it's more, and maybe about 100-some people in the company. Maybe some of you are here. And we would have more if talent there were to be had. And he said last year, the cost of top deep learning experts was about the same as top NFL quarterback prospects. Do you know how much they make? So I did ask. I didn't get an answer from this. Well, that's what our boss said, so I'm not sure. So this is a very good recruiting statement there. He put it in Businessweek, so I can quote that. Okay, so now I'm going to go -- so this is not just speech recognition. Speech is the beginning, so now we are moving from recognition to do translation, to do understanding, so I'm going to spend three minutes talking about each of the projects at a high level. Okay, now, for the speech translation project, we have the wire covered this story and the CNN covered this story, all within one year. So we talk about how Skype uses the artificial intelligence and deep learning to build the new language -- I do have a video, so if I have time, I'm going to show you some of the video, how this technology works. Now, we also have this speech understanding task, which is used, so recognition is not the whole story. You have to understand that for Siri to be able to talk to you and for Cortana to be able to talk, Google Now. They all used the technology called the speech understanding, but now, so this is one of the very popular models. This is a little bit different than a deep neural network. It has a recurrent characteristic, so the time becomes the layer in this type of architecture, and it's very useful for any time series problems, like speech translation, understanding and also for something like time series analysis. I read some papers that people are using this model to do stock market prediction, but I haven't seen, but even if they use that successfully, I don't think they're going to publish anything. Otherwise, they are going to -because it's kind of competitive. But recognition is not competitive. Everybody can get a [indiscernible], so they have to be recurrent, so we wrote some papers on this. So the next topic I'm going to do very quick. There are so many projects I want to do, so I'm going to selectively go through some of those. So the next -- this is a big subject called the deep image recognition at Microsoft, and Microsoft is a latecomer for this area. So I have some very interesting stories to tell you how this whole thing started. So in the fall of 2012, the same thing for image recognition. It's very similar to speech, except the history is a bit shorter, and also, it's not government that sponsors the competition, but Stanford University actually sponsored that. They don't sponsor it. They create the database to evaluate people's different teams of competition. So before 2012, people used this test called the ImageNet, so it's about 1,000 categories of images. So the purpose is to give you about 1 million samples. They train them. Everybody trains for every system. Before 2012, they're all shallow systems. Essentially, they do handcrafted feature engineering. They use SVM, they use sparse classifiers, just one single classifier are based upon the handcrafted features to do that, and everybody's doing that, and these are the typical error rates, about 26%. And that has been staying for about the last several years, so ever since the competition was created, that's about three -- about maybe five years ago. The first two or three years, the error also keeps dropping, up until this time. I think the previous year, 2011, also error rate is about 27%. It's about rather than in speech, you get 10 years of stable error. They got about 27%, 26% error over here to two years, two or three years. And in the same year over here, this is Toronto's group by the same professor, Geoff Hinton and his group, and they actually knew their speech worked, so that was two years after that. And essentially, if you have that belief that technology is good, and he put the best people in his team to apply the same technology, using convolution. It's a bit different than deep net, but similar kind of concept, many layers. At that time, they used seven layers. And all the sudden, they drop error from 26% to 15%, just like speech recognition. And that created a huge amount of excitement. So what I want to say over here is that, so during the fall 2012, that October 20th or somewhere -- is about October -- and that announcement of the results was made public. Typically, the way they do competitions is that people submit the result. At Stanford University, they evaluate. You have no idea how well you do, because the test is withheld. So when the result was actually announced, this is so good, many people don't believe it. So I think on the day the result was announced, I was actually here together in this room, in the very room, having the lecture on machine learning. There is also Chris Bishop, some very famous scientists in our company. We are doing actually company-wide machine learning lecture. I was one of the lecturers. So right before I give the lecture, so I get an email. I got an email from my cell phone, and Jeff Hinton sent an email to me, saying, Li, take a look. Look at how much margin it is to go from here. And then I wasn't sure how genuine it is, so I sent that to the company, all these people doing [indiscernible]. And they come back saying, there must be something wrong with this one. It's too good to be true. Most people didn't quite believe all this, but anyway, so actually, in that lecture, I actually also recorded, if you want to take it. I put this number over there. I said, I don't know whether we could believe it or not, but this is just a wonderful number over here. It turned out that the following year, all the computer vision group actually duplicated -- not only duplicated it, but also improved it better, including our Microsoft Research Group, so we are actually a latecomer in a sense. Okay, but anyway, so this is that particular year, just one single year. It makes this huge impact in image recognition, and I think University of Toronto took all the credit for this. So this model is a shallow model somewhere around here, and just the first year of deep learning, it dropped error down to about 15%. The second year, that includes New York University and also a couple of companies, including Microsoft. They're all participants in this competition. The best result is about 11%. It dropped error down a little bit more. Now, this year, 2014, the results were three or four months ago, last October, when the results were announced, the error was dropped down to 6.6% by Google. So they put the best people there and they built up the network up to about 20-some layers with the parameters, and without parameters, it's about 60 layers, and a lot of parameter you don't change. So between 20 to 60 layers. It's just getting bigger and bigger, bigger and deeper, and the error is down to about 6.6%. And then our Microsoft have a very nice group in Beijing, Microsoft Research in Beijing. About February 5th, this year, that was about three months after the competition results, they published a paper, dropping error down to 4.8%, 4.9%, so this is right here, this paper, by our colleagues in Microsoft Research in Asia. This is the thing. And it's just wonderful, so this is the number down here, and this is the number from Google competition. Yes? >>: You said 1K? >> Li Deng: 1,000. >>: That's the number of nodes? >> Li Deng: No. That's the number of classes at the output. How many categories you can assign to the image. Okay, so this number, and then I think the day, one day after this number was published, publicized, there was a huge media coverage, and then the title of the media coverage says that now Microsoft beat Google. And then the day after that, Google published a paper in the same -- no peer review, but just okay, and the media coverage also came back, because Google also immediately published a paper, not completely written. I think they look at this and then they reported 4.8%, and then the media title is that now Google regains -everything happened within about one month, within 1.5 months ago. But anyway, this is very, very interesting. That's the end of image. Yes? >>: What's the method for classification? In other words, you're saying there's ->> Li Deng: Oh, this is very simple. So you built layer-by-layer repetition. >>: How are they building those layers? In other words, how are they building the classifiers? In an automated way, or is it still ->> Li Deng: Everything is automated. The whole point of deep learning is that you have an algorithm called the back propagation. It actually gives end-to-end learning. You actually provide the target at the top layer, and then the learning goes all the way to the bottom with the raw features. >>: After it is learned or it ->> Li Deng: No, it is not learned. Correct. Structure has to be built, simply. So image, if you just directly use deep neural networks for image, it's not going to work well. You have to use a converted structure, and that structure is well known. Okay, so next, so my time is almost up, and this is actually just my introduction to everything. So the real work that was done after all this was done by what's called the semantic modeling, and that's getting much, much more interesting now, and I think Microsoft is really taking the lead in this area, although our company don't actually have a lot of publicity outside, because a lot of work has interesting value to Microsoft. So we do publish some of the work, so I'm not going to talk in detail about all the business implications of this, just to tell you what kind of technology we have developed. You can imagine what kind of business applications you may have. So we have -- in order to do this kind of modeling, we established this technology center, so I'm actually managing the researchers and developers doing a whole bunch of applications you're going to see down the road, next 20 minutes. So the model we built is called deep structured semantic model, DSSM, so think about this one as the deep semantic model, so we do have a whole bunch of publications here. I'm going to talk strictly from publications over here, so actually, at the end, I do have a list of publications if you want to know. Here you can see that. And, of course, applications, we don't talk too much, but I'm going to give a little bit of hint as to what kind of applications we have been doing. So I'm not going to go through detailed technical details, except to let you know that -- okay, so the whole point is I should say that, when you talk about semantic model, that means that you have the sentence coming in there. If you have business activity coming in there, what is the underlying semantics? What does it mean? If you know what it means, you can do a lot of things for email analysis, for web search, for advertisement. All these things can be done if you understand what the information meant by computer. And that's what's called -so the meaning, the structured meaning enabled by this model is what's called the semantic model. So the whole point of this model is that we -- yes, there are a few sort of technical points here, so it's that we train this model using various signals. I'm going to show you what kind of signal this is, but that signal comes from different applications. But most of them, we acquire the signal without human labeling effort. Otherwise, it's too expensive. So for example, one of these type of signals that doesn't require human labeling is the click signal for Bing. We have to have Bing in our company, so every time you search on the Bing or search Google -- if you search Google, Google got your signal. They use that to do whatever they do. Like if you use the Bing, Microsoft has that signal, and we use that signal. You actually donate that signal to Microsoft, to Google, so you don't think that you could benefit. You donate everything to them. And then it actually tells you that information from this to this is related, and that's an extremely valuable signal, and you don't have to pay any many to get it for the company viewpoint, and we take advantage of that, of that signal. So I'm going to show you some examples here. So I'll show you -- so rather than giving you all the mathematical description of this model, I'm going to show you an animation to show you exactly how this model works, and then I'm going to show you a whole bunch of applications for this model. So suppose that you have a phrase called a racing car, so you have to understand what's the meaning of a racing car. So it's a car, or it's used for racing, but when you put together, how does computer know what you mean? So the way that we do it is that we built a neural network, deep neural network, many layers, matrix, nonlinearity, and then at the end you get a 300-dimensional vector, so that vector, you don't know anything about it. So in the beginning, you randomize everything here. You code this in terms of whatever symbol. You code that in the symbol. Coding information in a symbol is very simple. You just look at the dictionary. The semantic meaning doesn't get captured when you do this. When you just multiply them all -- so if you don't train all these parameters, there is no meaning extracted. They're just random numbers. So when you initialize them, you get these vectors, 300-dimension vector, arbitrary. So racing cars giving you this vector. Now, it gives you something of Formula 1. Are they related to each other? They're related to each other, although there's nothing in common with the words. If you use a symbol, nothing in common, you don't learn anything. Now, if you do deep learning, you come up with the connection. I'll show you how it is. But during model installation, nothing is going to happen, right? This is random, so this one can be here, it can be here. They don't have to be the same, even if the meaning is the same. Now, when you get another one called racing towards me, for example, it's random, as well, so it has the same word racing -- racing, are they related to each other? Not really, right? Even if the words are the same, they don't mean the same thing at all, right? And these two are very similar to each other in the concept, but they don't have the same words in common. So if you use the symbol to code the information, like most of the natural language processing people are doing, you really don't know much about their connection, right? So that's the initialization. Nothing is learned here. Now, the next step is the training, so this is a learning process, and that really is the secret sauce of how everything works. So in the learning process, we compute what's called cosine similarity between semantic vectors. So when we initialize this, this goes to here, this goes to here. They are random, right? They are not the same, even if the meanings are similar, and these two are random, as well. Now, when we learn -- so when we learn the weights in the deep learning network, what we did is that -- okay, sorry. So when we learn the semantic vector, we actually know that when the user clicks, puts this racing car, Formula 1 website may show up and people get clicks. We know that. So if we know that, we know that they are related to each other. Therefore, during the learning process, we force the cosine distance between this and this to be small. We make the distance to be as small as possible, and that's the objective function. And since this almost never gets clicked when the user uses racing car, then we want to make them as far apart. So we want these to be close to each other. We want these to be apart from each other. So we develop this formula. It's just a function saying that if they're close to each other, positive example, we want the numerator to be big. That means cosine to be small. Cosine needs to be small, and then everything else we want to be big. So numerator being small, denominator to be big. So numerators are the ones that has the click information that means the computer knows that they're close to each other, and those which are not clicked, we make that to be the denominator, make it large. So when you take the ratio between the two, you optimize them, you force that to be close to each other. You force them to be far apart to each other. And then you do back propagation, and I don't have time to go through all the detailed algorithms, for those of you who know. Now, after everything trains, then you do the same thing again. See, the racing car, whatever it is, now these are trained, are learned. Once they are learned, Formula 1, wow, you know what they're saying? We get to the end of the training, we force them to be the same, so they are similar. Therefore, you can actually rank them. So racing towards me, when we train them, make them apart, as much apart from each other. Therefore, you have many other documents. You can use cosine distance to rank documents for a specific query. Therefore, when you type in something in the network, the right page will show up, even if the right page has no similar words to you. It's a very, very intuitive idea, and we use that as the basis for many applications. >>: Are you actively forcing those apart, or just bringing those together ->> Li Deng: Both. So if you optimize that cosine of the ratio, you do both. Yes. They do both. Okay, so I'm not going to go -- well, except that this all is done by GPU. Okay, so this is just example for web search. So there are so many other things that we have done. Question answering, it can be done this way. Automatic application recommendation can be done this way. Natural use of interface can be done this way. A whole bunch of things can be done this way. Our center explores many, many of these applications. I'm not going to go through all the detail, except that we published a lot of them, but usually, we don't publish in a big fire. We publish in the small ones so people don't notice that one. But now I said that. People can obviously -- but anyway, I'm going to show you two examples, given the limited amount of time. One is automatic image captioning. So that's a very interesting task. So not only do you recognize the image. You want -- let me go through this, so image captioning simply means that if you have an image here, and then what you want to say is that people, when you have the image here, you want people to write down sentences. People will write down a stop sign at the intersection on a city street. So it's not a recognition task anymore. It's called captioning generation. It tells the story of an image and of a video or of whatever, a movie or something. >>: Self-driving cars will need these. >> Li Deng: Yes, eventually, it might. So I'm not going to go through -- so the model that I showed you earlier actually was used in this system. Again, we actually published this paper in CVPR. We have about 10 people working together on this in the summer with four interns. We've worked hard on this, and Google at the same time has a paper on this, and then Stanford has a paper there. So that paper submission was about two months ago, and then the minute they submitted, they called New York Times, and the New York Times published a big paper about their work on the weekend. And we saw that, wow, we should have called them as well. So we called them, too, on a Monday. They said, that's old news already. But anyway, so all these got accepted. Six papers got accepted, because the six groups are doing simultaneously. Stanford has the paper, Toronto has the paper, Microsoft has the paper, and it's a big race over here. Everybody, and they don't even -- typically, people go to academics, CVPR and IEEE, they wait until conference to come here. No, they already publish in these open sources, and they call the media and media covers them. It's kind of a very strange world now, and all because of deep learning. Without deep learning, people wouldn't do that. But, anyway, I'll show you, but I don't want to go through all this, except to show you some results. Yes, all these are -- but these are the most interesting results to show there. So this is a photo, and then you ask the machine to describe what they are, and the machine will say a group of motorcycles parked next to a motorcycle. It's okay, not too bad, and these are human enabled. Human enabled, there are two girls are wearing short skirts and one of them sits on the motorcycle. And do you know why? People say, well, machines don't care about girls. And there for other examples, we get some very interesting results. So of course, if you compare this to this, do you prefer this or do you prefer this one? Maybe you prefer this one. Some people may prefer this one. Some people say equally well, so we actually did crowdsourcing to ask people. I don't have to go through it here. And about 25% of the people on the Mechanical Turk say that machine is as good or better than human annotation. So the example I gave you, maybe it's equally good. Well, maybe not. So the majority of them still prefer humans, so we're getting there. So next year, we probably expect this year to be 35%. The following year, maybe 50%, and then you see what I'm saying, so I think this technology is getting very quick. >>: Can you distinguish the human? Like if you give them ->> Li Deng: Yes, so if I don't tell you this machine is human, are you able to tell? Which one do you prefer? Well, I don't. Okay, if you can tell that, you belong to this. That's not bad. But how about the other one? So for this picture, which one do you say? A woman in a kitchen, preparing food. A woman working on counter near kitchen sink preparing food. Maybe something, right? That could be similar. But anyway, we choose good examples here, but there are better examples, and that's why it's only 23%. So you get the gist of what's going on here. So now I'm going to give the next example here, and this example is a real example in Microsoft Office. We released that. There's an article written up on this. I would like to tell you about what it is. So this is called the contextual entity search, how do you search? So I'm going to show you a very good example, I'd like you to go home to try. So I'm promoting Microsoft products using deep learning, and that's called -- so the article, Microsoft published that. So it's called Bing Brings the World's Knowledge Straight to You with Insights for Office. And the underlying technology, the machine-learned relevant model is actually empowered by deep learning our group actually contributed to this. So I'll show you how it works, okay? So if you opened up Microsoft Word and your kid probably is writing some article, and he will say Lincoln was the 16th President of the United States and he was born in -- I forgot. He wants to write the essay. Normally, do you know what he does? Typically, you go to Wikipedia? You type in. You can even type in Lincoln. If you type in Lincoln, a movie may show up. You have Abraham Lincoln, and then you get the information, you copy back. Now, for this Office Insights, this is a new feature for Office -- you don't have to do anything. You just right-click here on Lincoln, and then this will show up, right inside, right in your Microsoft Word. And then you say look at this, you copy answer back here. But how does it know it's Lincoln? So the movie Lincoln will show up, the car company called Lincoln will show and the town in Nebraska will show up. You know why? If you just put Lincoln there in your Wikipedia, all three may be there. You probably have to choose one of them. It's not, because our deep learning takes into account the context. So if you click over here, automatically, Microsoft Word, you'll get surrounding. It sees that, oh, they talk about American, United States of America, it has to be Lincoln, not the other Lincoln, and that provides a lot of [indiscernible], so that is going to make using Microsoft Office more productive, and that's the kind of productivity. It's one single example of how this deep semantic model is working. And the reason why ->>: It can show up as Abe Lincoln. It can be anything, right? I need some context to it? >> Li Deng: So the context is here. It's automatic, yes. Exactly. So the whole point of this work is that the X and Y semantic model will use this to condition on this when we train that cosine distance. Therefore, if you see something like this, this will be close to each other. Therefore, this Lincoln will show, because in the training, similar kind of things were put together. Yes? >>: Question. Is this just implemented in Office 365, so it's got to ->> Li Deng: No, it's only in Office. It's only Word Online. It's not everything. >>: What I mean is, it's online, meaning Office 365. That's where you've got the cloud behind it. It's not running on your PC. >> Li Deng: Correct. That's correct. It has to be on the cloud. Otherwise, you don't do all the deep learning computation. That's a good point. But 365 hasn't put that in yet. >>: I expect they will, though. And now I know what Insight will mean when I see it. >> Li Deng: Yes, exactly. So keep that in mind. So if Microsoft announced -- it actually was announced in just a small version of Microsoft Word. >>: Speed to run it? Is it super-fast? >> Li Deng: It's super-fast. You don't notice any difference. So I'll show you some negative examples for deep learning. For this cloud, if I talk with experts, I show a negative example, so people know that I'm not talking about flowers only. I'm also talking about something negative. So one negative example is called the malware classification. So for this example, when we have different neural networks, we don't really gain very much, and then we understand why. Most of the people doing deep learning think about it as a black box. So since you have gone through so many examples, you already know what tasks will be good for what kind of model. And the reason why this is very difficult to improve using more depth is really because the data that we got is actually adversarial in nature, so we actually published a paper on this about two years ago, on this negative example, but the result is still good. It's just one single layer neural net is as good as multiple layers, and in the past, people had a very hard time understanding why, because the more layers you get, the better arrangement you could get. Now, for speech, for language, for image, this is all true. Now, for some data, it may not be true, so just let you know that not everything is good for deep learning, and then if you want to know why, you have to really have to good scientific training to understand why, and the whole experience. So this is the same task that we failed using deep learning. It now becomes a standard task in Kaggle competition. Does anyone know Kaggle competition? Now, this year, we actually put in this task over there, so I heard about -- anybody know about this competition? So can you say a few words? How many groups participated in this. >>: I am not doing this. I did that sea plankton thing, which came up just before this. I did that, so I haven't even started doing this. >> Li Deng: So the result that we published two years ago may be verified or may be disproved by this competition, and that competition how many days to go? 47 days to go. So usually, it's the same thing. Everybody puts -- typically, people get about 100, 200 groups submitting them. The winner will get about $12,000 or something, but you have to put all the source. These are the companies that actually collect all the data, the source, and the pricing of the company from last year, in the previous year's -- last year quit to become president of a startup company, doing deep learning. Because in the past, deep learning, under his leadership, Kaggle actually has so many competitions. Always he told me that in most of the tasks, deep learning won, because he wanted to it by himself other than letting people do it. So he got this company doing it, and he's now a founder of this company. So we are going to see the result for this now, but I would expect that deep learning probably will not be the top one, if it's consistent with our result. So the summary is that -- of this talk, I'm going to quick finish very soon. So Microsoft actually started this deep learning about 2009. Now, we invented this type of DNN model, revolutionized entire speech recognition industry from the 20 year old system to the new deep learning system now universally adopted by both industry and academics, and everyone knows. And Microsoft created the first successful example of deep learning at industry scale. Now, deep learning is very effective now not only for speech recognition but also for speech translation. Microsoft did a tremendous amount of work on this, and also image recognition, and this now is in image tagging for OneDrive, Microsoft, and we are late to start this, as I showed you earlier, but we caught up pretty quickly. Image captioning, the one I showed you that hasn't gone to any product yet, someday, you are going to see that coming soon. Maybe not -- I don't know whether Microsoft will be the first one to come out. Maybe not. And the language understanding, I showed you a little bit of this semantic modeling. I didn't have time to talk about all this. Some of those I probably do not feel comfortable talking about. User, a whole bunch of things -- and the enabling factors for all this success is that we have a big data set for training deep models and we also have very powerful general-purpose GPU computing. It's very important. Otherwise, we couldn't even run all these experiments. And the final important factor that many people ignore in this community is that we do have a huge amount of innovation in deep-learning architecture and algorithms. So in the beginning, we got a comment on some of these things saying that is neural network the only architecture around? If it is, then there's not much innovation. It's fixed. Now, to do the kinds of things, semantic modeling, architectures are so varied. A lot of innovation is coming into how to design the architecture based upon the design, the deep learning principles and also your understanding of the domain problem in order to design the architecture to make that effective. So there's a huge innovation opportunity here, and the innovation from the Microsoft Research standpoint is how to discover some distant supervision signal which are free from human labeling. If you require human labeling, too expensive. It's just not going to scale up. So in the one search example I showed you, no human, right? Automatic. User, you gave the signal to us, and we used it. If you don't give us, it's not going to succeed. And how to build, and after we understand how to get this supervision signal, then we need to know how to build deep-learning systems grounded on exposing such smart signals, and the example will be DSSM I showed you earlier. I think that's pretty much -now, another summary. Yes, this summary actually is written for experts on deep learning. So for speech recognition, I would say I'm very confident in saying all low-hanging fruit are taken. No more. So you have to work very hard to get any gain, and so for that reason, my colleague and I wrote a book, about 400 pages book, summarizing why this is the case. And the people, they work very hard to do that, so it's good to get out, to get other, faster progress on. So every time you go to the conference, a lot of work that you see, beautiful pieces of work, they don't get as much gain. A tiny bit of gain, much more work, not like three or four years ago, just one GPU, dump it in there, a load of data, do a little innovation in the training, you get a huge amount of gain. Now, the gain becomes smaller, and this is not to discourage people from working on this one. This is actually to encourage serious researchers and graduate students who want to write PhD theses to get into this area. So low-hanging fruit and if you're a startup company, don't do that. If you do that, unless you find some good application to use the technology, not the technology by itself. Now, for image recognition, I would say most of the low-hanging fruit are taken, so now it's getting -- the error rate just dropped down to 4.7%. It's almost the same as human capability. So if you get more data, you get more computing power, you can still get error down a little bit more. Not much more anymore. Now, for natural language, I don't think there are many low-hanging fruit there. Everybody had to work very hard to get the gain, including all the deep-learning researchers. Now, for business data, the advantage, there's a new research frontier. Some low-hanging fruit there. I'm not going to say too much about this. Now, also, for small data, many people think that deep learning may not work well, and that's wrong. People always say that for small data, use traditional learning data, don't use the deep learning. Big data is deep learning. That's wrong, and there are so many examples for Kaggle, and we're going to hire somebody who actually won that competition to join our group to do this. So deep learning may still win. It's not guaranteed. For big data -- for perceptual data, especially, it's big. Deep learning always wins and wins big. >>: Perceptual meaning? >> Li Deng: Speech, image, human sensory, and then maybe gesture, maybe touch, that kind of thing. I think deep learning will get it -- I have some theory as to why that happened. Partially it's because if you look at the existing human systems, it can do better. The neural network really simulates many of the properties of this. There's no reason not to do well. And be careful. If the data has adversarial nature or if the data has some very odd variability, which I don't go through here -- we have tried some examples -- be very careful. It may win or may not win. So for those of you who are interested in these kind of problems, security problems, pay attention to the new Kaggle competition. When they come out, you'll probably get better. I could be wrong, okay? I'm actually curious. I work on this with some very strong people. On that, we haven't got any gain, and if anybody gets some better results, that will be very interesting. I could be gone. I'm very open to it. But so far, I am not wrong yet. That's before that competition result came out. Now, issues for near-future deep learning. This is I think my last slide. Now, for perceptual tasks that I showed you earlier, speech, image, video, gesture, etc., now, the issue here, and this is a very important issue. Everybody in this community asks the questions. None of those has answers. I'm just throwing out to you to make you think, and these are all the issues I myself think a lot about this. I'm looking -- every time I go to the conference, I look for some answers, whether they address any issues. If they address these issues, it will be interesting. It could be surprising. So with the supervised data, like speech has all the supervised data. Image has a lot of supervised data. The issue is what will be the limit for the growing accuracy with respect to increased amounts of labeled data? So far, it hasn't reached the limit. Speech kind of reached the limit. We can see that if you get more data, you don't get as much gain. Now, for image, we haven't seen that limit yet. You get more data, you still can get better result. So we don't know where the limit is. Now, once we get the limit, then we want to ask the question, beyond this limit, or when the labeled data, which are typically very expensive to collect, if the collecting more and more data becomes not economical from a company viewpoint, then will novel and effective unsupervised machine learning emerge, and what will they look like, whether it's deep generative model or some other kind of model? And many people actually are working on this from an academic viewpoint. So when you go to academic conference, you look at people doing unsupervised model generative model, don't throw them away, although they haven't seen any great result compared with the supervised learning that I showed you earlier. So I always -- every time I go to the conferences, I go to these sessions. I don't go to whatever sessions I know about, because they might show some good result after the limit is reached, and the limit probably will be reached within two or three years of time. That's my expectation. Now, for cognitive tasks, including natural language, reasoning, knowledge, which I showed you a little bit about natural language, the issue is that whether supervised learning, like machine translation for example, beat the non-deep learning state-of-the-art system like speech and image recognition? So far, machine translation -- there's a beautiful piece of working coming out from Google just about two months ago, from this conference called -- three months ago. Just beautiful piece of work. They matched the best result for the competition. They haven't got a chance to beat them yet. So every time we get together, for a conference, they keep telling me that they've got a better result. I never trust that, but so far, it hasn't shown that it's beating that, so we don't know whether it will actually do as well. Because this is an easier task than this. The cognitive is the higher level of the brain activity, and this is the lower level of brain activity. Now, and then another issue is how to distill the distant supervision signal for supervised deep learning? So the example is that DSSM, we've already got many low-hanging fruits, and then what is the best way to exploit this still information -- like web search would belong to this kind of class problem. And then you have to think hard how to get that information into your system, and we have -- I'll give you an example of DSSM how to exploit that information, and this is the low-hanging fruit. Once you realize that, that collected information will help you to increase your numerator and decrease your denominator. You know how to do that, then you get a lot of the bang out of everything. That's a very easy problem, and as a matter of our deep learning engines are pretty much the same. We just get different supervision signals, we get different -we actually get different -- the objective function we obtain in the same way. And the next question is that, whether the dense vector embedding, which I showed you earlier -- I didn't use that word, the embedding. The embedding is that you get all the symbols, you embed them into a vector, load them into a vector. So whether the vector embedding with a distributed representation, whether they will be sufficient for a language problem. I personally believe, based upon some of the literature that I have seen, it could be quite likely to be yes, but that dimension has to be very big. And the alternative question will be whether we need to directly encode or decode the syntactic semantic structure in the language, like until about six months ago, my feeling is yes. We need to have somehow to be able to able to recover the structure so that you can do reasoning. After I read all these, Google's paper, Facebook, a few just beautiful sets of paper, we haven't gotten a chance to do research in this area. >>: Your language translation has to do that, for example. >> Li Deng: No, no. >>: You don't have to recover the semantics and then reconvert them? >> Li Deng: No, just the recurrent network that will just speed all this information up, and that's the one that I mentioned. This is amazing. The first time we read the paper, we said that's too good to be true. It turned out to be true. But anyway, this is outside the question. Now, for big data analytic tasks, which I don't have time to -- I'm not going to go through all the detail, the question is whether vector embedding is sufficient, just like in language. It's not clear here. Should the embedding method be different from those for language? It's not clear. What are the right distance supervision signals to exploit for these kind of tasks? We don't know. Whether there's also data privacy issues or whether we need to do all this kind of analysis, embedding encryption -- basically, deep learning has just -- when I say near future, I mean two or three years from now. People will go through all these issues. I expect half of these questions will be answered within the next two or three years. So that's the end of my talk, so I give you -- I'm open, ready to answer any questions. I'll give you some lists of selected publications from our group and from Microsoft. And this is only selected group over here, selected publications over here. Thank you very much. Yes, question. >>: In one of the slides, where you showed the deep neural network architecture, its dimensionality is actually continuously decreasing from 100 million to 500K, so it's always decreasing. From my experience with some image classification work, one -- like the first layer, if you do an expansion and then go down. >> Li Deng: That might work. That is called the bottleneck feature, but for images that may be true. That may be true. But if you read the ->>: I've seen the current results with that. >> Li Deng: But if you read this recent competition, the paper by Google, by our Microsoft Asia people, it's just standard convolution network. You do convolution, you do pooling, you do convolution and then pooling again, which does give you ->>: I think the same effects could possibly be replicated by two layers where you end of project up and then go down? >> Li Deng: I don't believe so, because this task has been done by how many, dozens and dozens of groups. They have been doing that since 2012, for three years. I have never seen any architecture looking like what you talk about. So I think what you want to do is that you actually want to try your architecture on one of the image net tasks. If it actually gives you better results, I think you will be very famous. Yes. >>: What is the prospect on a regression-type problem. All of the things that you mentioned are more classification type. >> Li Deng: Classification? >>: Yes, more regression-type problem. >> Li Deng: Regression problem? So actually, the ranking problem, the DSSM that I talk about, actually is a ranking problem. Well, it's not really regression. It's called a ranking problem. Now, for regression problem, this one is just as good, so it's like the only difference to be made is like the upper layer, rather than being a soft max, you assume it's just a mean square error. >>: But are there any dramatically good results? >> Li Deng: But so far, we don't have any competition results for regression. If there were, I'm sure you will get them. Yes. >>: So the neural network needs a lot of data, so what's a lower bound of the data? Like how many images? >> Li Deng: Okay, so for ImageNet, it's about 1.2 million images, which is considered to be big before, but for speech, it's very small. Speech, we used about -- the data that most of the people use, it's up to about a billion samples in terms of range, and that's the average for speech systems that we have seen. Now, for 1 million, but there are some other tests with more data, so actually, the more the better. But the whole point is that up at about 1.2 million for images, deep learning already showed dramatic advantage, so I expect that the limit is lower than 1 million. And then in some of the tasks that I showed the Kaggle competition for the chemical or drug discovery task, the number is only about 40,000 or something, and they got better than anything else, but they have to do a little more work, how to regularize. If you don't have all the data, you have to do more work. The [indiscernible] machine does more work. >>: Okay, so just one general question. For the training, my understanding, suppose for image, for over there, it can be for paging, scanning, translation, discussion, all kinds of stuff. For speech, it can be different dialogue or different noise or look at -- how can we know when the model is overfitting, or for example, is ->> Li Deng: I know. So the convolution, for example, in image, only deals with the shift. That's good. You are talking about so many other things, and that's why you have many of what's called feature maps, not just one single maps. If you have one single feature map, you only can deal with one type of distortion. You have many, typically, you get about how many, 20 to 200, somewhere around there. So for different feature maps, you will hope that different maps, because you initialize them differently, they will capture different kinds of variability, and that probably would answer to you. So variability, the overfitting, so okay, so if you have lots of data, overfitting may not be the problem. If you have small enough data, overfitting becomes serious, and there are many techniques developed in the deep learning community that solve that overfitting problem, like drop up, drop out technique. If you have small amount of data, always use drop out. If you have small amount of data, always use pretraining, so those are [indiscernible], using those types of models to initialize them. Therefore, you actually have partially solved the overfitting problem. Then, when you do back provisioning, you tend to have less overfitting problems, and that will be the right answer. >>: Okay, thank you. >>: And we have last one. >> Li Deng: Yes, question. >>: Yes. So you mentioned GPUs as being kind of one of the enabling factors, and clearly for training, there's been some huge resources thrown at these problems, but one of the other trends is we're seeing companies like NVIDIA put more GPU technology in portable devices, and I'm just wondering, how does that play out? Are portable devices doing the classification, doing the speech recognition, or do we always have to maintain connections to a larger cloud? So does GPU power in a portable device buy it ->> Li Deng: So most of the things actually are on that base of voice search, so the information -all the computing is done in the cloud, so the result gets transferred back. So within this device, you really don't have to put all those things. But also, I have seen a lot of startup companies in speech recognition, especially in China, they actually built the embedded deep learning model in here, and that requires -- they do some accuracy, but as long as they find the right application scenario, losing some accuracy, it's okay. So I think there is a big balance, but in the US, most of the devices they have seen actually use the cloud. >>: But looking forward, do you see that shifting? >> Li Deng: I don't see in this country, no. I really don't see it, because so far, transmission, it's so low. The bandwidth is large enough. >>: And we'll have 5G in a few years. >> Li Deng: Yes, so I personally don't see that. >> Zhengyou Zhang: Okay, thank you very much.