>> Bob Moses: Thanks for coming tonight. Welcome... Engineering Society. Tonight we've got Ivan Tashev talking about...

>> Bob Moses: Thanks for coming tonight. Welcome to the December meeting of the Audio Engineering Society. Tonight we've got Ivan Tashev talking about the Kinect, which is kind of exciting. I read last night online that in the first month it sold 2 1/2 million units, and the iPad sold 1 million. So 2 1/2 more -- times more successful than the iPad already. That's probably one of the most successful consumer products ever sold at that kind of a rate. So it's exciting to have one of the architects of the device here and look at how it works. I know there's people out there that want to learn how to hack it. If you want to know that, you have to somehow get Ivan a little bit tipsy or something. We have a -- kind of an idea of what we might do next month just in the way of the January meeting. We're trying to get a guy named Dave Hill to swing through here on his way home from NAM. If you know who Dave Hill is, he's -- how would you put it? -- boutique audio designer. He has a company called Crane Song that makes real interesting signal processing equipment. They just announced a relationship with Avid and they're doing stuff for Pro Tools now. And he also designs stuff for some of the other very high-end boutique microphone preamp companies. Very interesting guy. I see him every time I go out to Musikmesse in Frankfurt in the middle of February. He's always wearing leather shorts and Hawaiian shirts in the winter of Frankfurt, Germany. He's a fun guy. So we hope that's going to happen. That's not solid yet, but that's what we're trying to put together. So stay tuned to our Web site, and we'll put something fun together for next month. A couple of housekeeping things. We need to know how many AES members we have in the room. And we don't have Gary here to count, so do you want to do that, Rick? >> Rick: Ten. >> Bob Moses: Ten. Thank you. And want to count the ->> Rick: Yeah. >> Bob Moses: -- the rest? For those of you who are not members, we'd love to have you join our association. The AES, if you're not familiar with the organization, is a professional society for audio people. All kinds of people, recording engineers, product designers, acoustics, researchers, and so on. It's been around for over 60 years now. A lot of the really important audio developments in the history of audio have been sort of nurtured and incubated within the AES, you know, the development of stereophonic sound and the compact disk and MP3 and so on. A lot of that initial research was presented and discussed within the AES. And there's a lot of really interesting people. And I can speak from my own experience, it's been kind of the backbone of my own career, starting companies, raising funding, recruiting employees, getting the message out, AES has always been there supporting me in everything I've done. And we have a lot of fun in these meetings too. So I encourage you to consider joining if you're not already a member. Speak to any of us during the break or after the meeting if you're curious about the organization and what it's all about. The Web site is on the -- our local Web site is that URL on the sign there. If you just go to aes.org, that's the international Web site. You can read more about it there too. Let's see. If you didn't sign the little sheet in the back of the room when you came in, you might want to do that. That gets you on our mailing list. It also makes you eligible for the door prizes that go out tonight. And we've got some Microsoft software and some other goodies to give out. And, Rick, you wanted to make a plea on behalf of Mr. Gudgel? >> Rick: Yeah. MidNite Solar, manufacturer of off-the-grid energy management systems, is looking for an engineer type who has embedded systems and hopefully DSP experience. So if this is you or you know somebody who fits that, get in touch with me and I'll put you in touch with Robin. Or you can grab -- you can go to Google and look for MidNite Solar. You'll find that they're a startup. >> Bob Moses: Cool. Also just keeping an eye on our Web site and the national Web site for other job openings; it's one of the things AES does is advertises job openings and helps people out. At this point we usually go around the room and introduce ourselves, just a quick what's your name, what do you do with audio, what's your interests; just 50 words or less. I'll start. I'm Bob Moses. I'm the chair of the local AES section. I work for a company with a funny name called THAT Corporation designing semiconductors for audio. And I'll leave it at that. >> Ivan Tashev: Good evening. I'm Ivan Tashev. I'm the speaker tonight. Work here in Microsoft Research doing audio software and member of the Pacific Northwest community. >>: My name is Rob Bowman [phonetic]. I'm a product engineer. I work here doing headsets and Web cam audio. >>: I'm Travis Abel [phonetic]. I'm not a member yet. But I'm into composing and engineering and drumming. So I'm kind of everything all together here. >> Bob Moses: Welcome. >>: Hi, I'm Trina Sellers. I'm a personal manager for Joey Marguerite [inaudible]. >> Joey Marguerite: I'm Joey Marguerite. I'm a recording artist and a jazz, soul/jazz singer and songwriter. I have a production company called RooGal Productions . I worked in radio and done some on-air talent and doing engineering and live engineering for a number of years as well. >>: Rick Shannon [phonetic]. I'm the vice chair on the section. I'm the Webmaster, live sound engineer, recording engineer, anything audio. No ladders, no lights, no video. >>: Ryan Ruth [phonetic]. Right now recording engineer and live sound engineer. >>: Gary Beebe with Broadcast Supply Worldwide in Tacoma, audio salesman. >>: [inaudible] I work here in Microsoft Research [inaudible] speech signal processing associated with acoustic [inaudible]. >>: I'm Ken Kalin [phonetic]. I'm recently laid off from a medical device company [inaudible] for 18 years and now I'm looking at contracting at Physio-Control. I got an e-mail from Bob Smith [inaudible] and sounded like an interesting thing because audio has always been a hobby. >>: My name is Matt Rooney. I work here at Microsoft, the GM of mobile games at Microsoft Game Studios, and before that, about 20 years ago, started on video games doing audio and DSP for games. >>: My name's Scott Mirins [phonetic]. I work at Motorola, pretty soon to be Motorola Mobility, so in the cell phone division and do audio software and that type of thing [inaudible] stuff like that with phones. >>: My name is Peter Borsher [phonetic]. I work at Microsoft in the Manufacturing Test Group [inaudible] hardware design engineer. >>: I'm Dan Hinksbury [phonetic]. I work with Peter in manufacturing tests, most recently on Kinect. >>: Mike [inaudible]. I'm just a guest here checking it out. >>: Christopher Overstreet, pianist, composer, and private researcher of just really interactive systems, mapping gestures to other types of output mediums: 3D audio and video [inaudible]. >>: Brian Willoughby [phonetic]. I do software and firmware and hardware design. Currently working on a multitouch controller for controlling audio [inaudible]. >>: Steve Wilkins. I've worked in psychophysical acoustics in a consulting firm. I'm a musician. I have a studio and I have a bunch of stuff in the can if anybody wants weird sound for games. >>: Adam Croft, primarily film sound, but I've got a background in live [inaudible]. >>: And Greg Duckett. I'm the director of R&D engineering at Rane Corporation. We're a -we design and manufacture professional audio equipment up in Mukilteo. >>: I'm Dan Mortensen. I do live sound for concerts and on the local committee. And if you want to shake the hand of somebody who worked with Miles Davis and Thelonius Monk and Barbra Streisand, then at the intermission here you should come up and talk to... >>: Frank Laico, recording engineer. >>: Steve [phonetic], [inaudible] engineer, do certain board design, former chair of the committee. >>: Lou Comineck [phonetic]. I'm a broadcast video engineer. I do -- specialize in live multicamera broadcasts of football games, baseball games, the Olympics. I work for the major networks, ABC, ESPN, so on and so forth. >>: My name is Chris. I'm an electrical engineer. I had a recording studio for a while, but now I build robots right around the corner from you guys at Mukilteo. And in the evening I've been writing songs [inaudible] I've got to stop because [inaudible]. >>: [inaudible] >>: Chris Brockett. I'm with a Natural Language Processing Group here at Microsoft Research. >> Bob Moses: Very cool. It's always a fun crowd of people. And do go talk to Frank during intermission, because he discovered Miles Davis. And it's an amazing group of people in AES. So tonights guest speaker is Ivan Tashev. Ivan has his master's and Ph.D. degree from University of Sofia in Bulgaria. He works here at Microsoft, Microsoft Research, doing audio and acoustics research. I'm not going to read all this off the Web page here. He was a primary architect of the sound subsystem in the Kinect device. And I got to see a demo in his lab about a month ago, and it was really, really cool. The coolest thing is Dr. John Vanderkooy, who is the esteemed editor of the Audio Engineering Society journal, one of the most revered scientists on the planet, was dancing to some hip-hop music, and I got it on my high-definition here. So that was a real treat. So thank you, Ivan, and thank you Microsoft for hosting us tonight and then giving us this really interesting preview of what's going on in the Kinect device. [applause] >> Ivan Tashev: So good evening. Thanks for coming for this talk. Before even to start, I want to apologize, first of all, my heavy accent. Here in Microsoft and even in Redmond, more than one half of us are not born in the U.S., so using this broken English is kind of norm. But I have seen some puzzled faces when I spoke with people which are outside of Microsoft. Second thing is that during the talk we will see a lot of mathematics and equations. If you don't want, just skip them, try to get the gist of what is going on in the presentation of the sound, how computers see it and how they do process the sound. It's not intended to be a heavy digital signal processing course; it's just for those who have seen those mathematics. And then first couple of words where we are. We're in the Building 99 of Microsoft Corporation. This is the headquarter of Microsoft Research, this research ring of this corporation. Microsoft Research was established in 1991. This is the year when the revenue of our company exceeded $1 billion. From all of our sister companies with similar revenue, none of them created their own research organization. Not one of them is alive today. We claim here in this building that Microsoft Research [inaudible] 850 researchers creates that stack of technologies which company needs when it is necessary. We bring to the agility of this company to survive this very fast and very quickly changing world. We don't do here [inaudible]. We cannot say, okay, we ship to Windows or Office. What we provide is the pieces of technologies, algorithms, approaches, which help to make Microsoft products better. [inaudible] to continue and to speak, I will show you a 60-second video. [video playing] >> Ivan Tashev: So today we'll talk about the sound, we will see what we can do to remove the unwanted signals from one single audio channel, what we can do if we have multiple microphones, can we combine them in a proper way to remove the things we don't want, and we'll talk about some basic algorithms which we use, and we'll end up with some applications. And at the end I'll basically just let the Kinect device with some of the games I brought with me to run and to let you guys to try it. Even at some point somewhere here we have a break, and most probably we will start to do some gaming, to have -- basically to get hot on the technology. Now, sound capturing and processing. So the point we are going to look to this process is kind of different than most of the other engineers do. For all of them, this is the microphones, and they go and record with the highest possible quality. They have the freedom to have a professional setup where you can put the microphones in a specific way to make the recording to sound better. In computer world, and this computer world includes mobile phones, includes your personal computer at home, your laptop, the Kinect device or anything else in your media room, the speakerphone in your conference room, usually first we don't have the freedom to do the professional setup of the microphones, and this requires a heavy processing of the audio signal, so we can remove the noises, reverberation, et cetera, et cetera, and get some acceptable sound quality. And because in engineering nothing is free, this processing itself introduces some certain distortions in the audio signal. So it's always a question of balance, more processing, we get rid of more of the unwanted signal and into just certain processing. But in general, sound capturing and processing from this point of view has all of those three categories. It's a science. We work with mathematical models, with statistical parameters of the signals. There is a heavy mathematics in deriving those equations. And we have repeatable results. We use the same input always and there is the same output. So this is all science of science. But [inaudible] the consumer of this processed sound is the human, its own ear and its own brain. And once we go to the human perception, this is already an act. Because it's very difficult to explain in equations what humans will or will not like after we do some heavy processing. And of course it's a craft. As most of the audio capturing can now do recording, you always have some tricks and some stuff which you put in your processing algorithms in a way that it sounds a little bit better than the same algorithm from the competing lab. Okay. So in computing we have mostly two major consumers of this captured sound. First is the telecommunications. Starting from the mobile phone, desktop PCs, we run software like Skype, like Microsoft Office Communicator, Windows Live Messenger, Google Chat, you name it. And the second is speech recognition. This is something very specific for the computer world, but all of the speech recognition is important component of the interfaces between the human and the machine. And it has its own specifics, which we're not going to go deep here today. So in general the discussing of those basic algorithms is pretty much the meat of the talk today. And we'll just show some aspects of building the end-to-end system. But the principle here is that a chain of optimal blocks, each one of them is optical in certain sense, is usually not optimal, is suboptimal, so you have to trick and tune those blocks together in align as they are connected into the processing. So let's talk about the sound and the sound capturing devices. From a physical point of view, the sound is nothing else but moving compressions and rarefactions of the air. They usually move straight and the sound has several characteristics. The first is the wavelengths, the distance between -- the closest distance between two points with equal pressure. And it has its own frequency. If you stay here, how many of those will go to that point for one second. Then based on those three important properties, frequency and wavelength, which are connected to a constant called speed of sound, and in there's it's 342 meters per second at 20 centigrades. This speed of sound is pretty much well without almost any thinking there. You change the atmospheric pressure, the speed of sound changes. You change the temperature, the speed of sound changes. You change the proportionism of the gases in the atmosphere, the speed of sound changes. So it varies and it goes down to 330 meters per second at the freezing point of zero centigrades. But every single wave will have those parameters. So intensity is how much the atmospheric pressure changes. And usually it is measured in logarithmic scale. So we have to have 0 dB, a reference. And a reference is selected to be 10 to the 12 per square meter. This is the amount of energy we have from one square meter is 10 to the minus 12 watts. If we expose and share one square meter on sunlight, you get around 500 to 1,000 watts. So the first conclusion here is that sound is extremely, extremely low-energy phenomenon. Second, usually, if you count the intensity as the power which goes through a given surface, we can use the sound pressure level which is measured correctly in changes of the atmospheric and of the pressure in the atmosphere, and again because it's logarithmic scale, 20 [inaudible] pressure is selected as a reference. This is the threshold of hearing of a human ear, around 1,000 hertz. So 0 dB is a sound at 1,000 hertz, which we can barely hear. We're not going to discuss any aspects of human psychology, psychoacoustics and how humans perceive this sound. That's pretty much the source of separate talk. But this is just to measure the reference here. So the propagation of the sound across the area is not free. We have energy losses. Of course first with the distance, because we have the same area but sound basically goes in [inaudible] well, if you increase the distance twice, you have two times smaller energy through the same -through the same surface. And that's the famous one under [inaudible]. Besides this, the compression and ratification is related to losses because of the shrinking and expansion of the molecules. And this is pretty much the losses of the attenuation of this sound. That's a function of the frequency in decibels per meter. And what we can see here is that in up to 20,000 hertz, which is how much we can hear, it's negligible. For one meter it's 0.6 dB. And once we go one meter further, we will [inaudible] because of the distance. So this is why in most of the mathematical estimations we can see that the sound propagates in the atmosphere without losses. This is valid for this room. This is not valid if you tried to do a sound propagation on a couple of miles by some other resource. But for most of the normal conditions when we use a sound as an information medium, that's true. And just couple of words about another specific for the sound phenomena. This is that the sound, besides being a very low-energy phenomenon, it's very [inaudible]. So humans are considered that they can hear between 20 and 20,000 hertz. This is 1,000 times difference in the frequency. For area information, the visible light, the low-frequency red signals have a wavelength around 340 nanometers. And the bright blue [inaudible] is 720. Which means all the colors we can see actually capture less than two times difference in the wavelength. And, on the other hand, in this sound, we have 1,000 times difference in frequencies. And there is one more thing. The wavelength of the visible light is measured in nanometers. This is 10 or minus 9th. Which means that every single object around is way, way, way bigger than the wavelength. So the sound propagates in straight lines. We have perfect shadows. You can see my shadow. And this is definitely not the case for audio. The audio is surrounding us. Objects are comparable with that wavelength. One hertz is 3.4 meters. 1,000 hertz is 30 centimeters. 10,000 hertz is 3 centimeters. This is [inaudible]. So not only, not only, we have objects which are comparable with the wavelength, but together simultaneously in the same sound to use, we will see pretty much a sound which behaves like a visible light. For 10,000 hertz, this thing will have a clear shadow, and 10,000-hertz sound source will not be hearable here. On the other hand, a hundred hertz will go around this desk without any problems. So we face all of those, the defraction [inaudible] effects during the submitting of the same sound because our environment has size of the objects which are comparable to the wavelength, and in some cases you will have a pretty much [inaudible] with the light. And in some cases they are acoustically transparent. And of course you can see how much bandwidth can -- at the interval and frequencies different animals can hear. So human hearing is between 20 and 20,000. And you can see that we are actually pretty good listeners. Better than us are -- common house spider I think is what is missing here. Interesting case is right here. The bat, which can hear between a thousand and 10,000 hertz, but [inaudible] signals at 40-, 45,000 hertz. And this is how the animal basically can do the location and detection of the objects in front of it. And the other case is the whales. They can do this and they have this huge [inaudible] in water. They have way, way worse hearing in air. And the reason for this is that it is very difficult to match the acoustical [inaudible] of the air in the water. And just for information, what means how loud is the sound, so 0 dB is the threshold of hearing. You drop a pin, it's around 10 dB. In a quiet library, you have around 40 dB. That's a very quiet sound. Here our sound pressure level when I do not speak is around 50 dB SPL. And it goes higher and higher. To hear a machinery shop, which is around 100 dB, this is close to the threshold of pain and it's extremely unhealthy, and it goes up to 170 dB for the Space Shuttle launchpad. This is for the set of frequencies we can hear. Otherwise, for those which we cannot -- let's say low frequencies, I just can't give you an example. A passing cloud above us generates a sound to 170 to 190 dB SPL. But the frequency is way, way below. One hertz is more changes in the atmospheric pressure, and it's a subject of meteorology, not subject of acoustics. So how we're going to get the sound. Those microscopic changes in the pressure, in the atmospheric pressure, and convert it to an electrical signal. So those are the microphones. All of the microphones use some physical effects to convert small movements of a diaphragm back and forth into electrical signal. This can be a crystal, which is based on the piezoelectric effect. We have two small pairs of crystal connected to the diaphragm, therefore it moves back and forth because of the changes of the atmospheric pressure, and we have electrical signal. This can be a dynamic which is pretty much inverted loudspeaker you hear [inaudible] and we have a coil which moves inside, moven- -- driven by the diaphragm. And condenser, which is pretty much two plates and one of them moves and this is -- changes the capacity, and somehow we convert this into an electrical signal. So we're not going to go in detail how the movements of the diaphragm are converted to an electrical signal, but what matters here is actually that we have to have the diaphragm. So pretty much the microphone is nothing else but a closed capsule with a diaphragm. Closed capsule means the pressure here is constant. And when it changes on the outside, the diaphragm starts to bend back and forth. And if the size of this microphone is smaller than the wavelength of the sound we want to capture, then we can see that we capture this, the atmospheric changes in the atmospheric pressure in a point, in the space. So this is acoustical monopole. On top of this, this microphone pretty much reacts on the sound in the same way regardless from the direction it came from, because those are changes in the atmospheric pressure. Completely different case we have if we have a small pipe and the diaphragm in the middle. So this is called acoustical dipole and has a very specific behavior when the sound comes from different direction. So if we have a sound coming from 90 degrees, the soundwave, the changes in the pressure came here, and we have the same pressure of both sides of the diaphragm. It doesn't move. If the sound came from the access of this pipe, first it will hit the front. And with some small delay it will hit the back. This means that we'll have a difference. And it will react. So this is why the activity pattern of this dipole is figure eight. At 90 degrees plus/minus, we don't have sensitivity. Even if we have a loud sound, it is not registered. But if it comes from across the access, the access of the microphone we have [inaudible] sensitivity. Unfortunately because this microphone doesn't do anything else but snap and measures the difference in the atmospheric pressure here and here, so pretty much it does the first derivative of the atmospheric pressure. As a result, we have a not very pleasant frequency response. If the signal [inaudible] with the very low frequency, pretty much the same thing happens on both sides and our microphone is not quite sensitive. Once we start to have a higher frequency, then the sensitivity goes up. But, still, this pressure gradient microphone is important component of the microphones. And so far we assume that a microphone is in a so-called far field sound field. Which means the sound is somewhere very, very far, and the magnitudes on both sides of the microphone are the same. If we have a sound source which is closer, what happens is that we have a certain sound part here, and the sound has to travel this distance, which means that we have already a comparable difference in the sound pressure level here and here just because of the difference of the sound parts. And then the microphone starts to behave a little bit different. Now, of course we don't want that 6 dB-per-octave slope towards the higher frequency, so we have to compensate somehow. And in a simple way, we can compensate by adding one single as a group which has the opposite frequency response, or we can have a flat frequency response. And if you compensate for far field, as you can see in pretty much the entire range, we'll get our flat frequency response. So let me reiterate this. If you will turn back here, this frequency response compensated with one RC group, we can flatten it, and then for far field sounds, we'll have the green line here. But then when we have a sound source which is closer, so we grab the microphone let's say from this distance and I start to approach it, then the effect of the different -- of the different -difference will actually -- what's this? This is not me. Anyway, so it will basically have the effect that we did overcompensation. And the basses ->>: It could be somebody outside wanting in. >> Ivan Tashev: Doesn't matter. So what happens that for closer distance, let's say this near is 2 centimeters, the microphone is a close-up microphone, we will have overcompensation. Okay. If we have compensated for a microphone which is close to our mouths, means to straighten the -- this line, then for far field sounds we will have basses basically going down, what we call this. So what happens when you hold the microphone, which is usually estimated singles, which is usually estimated for 20 centimeter to centimeters and they do this, and you see how basses go, basically enforce it, and you hear way, way different voice. Or if you have a certain microphone, directional microphone, on your [inaudible] right here, and certainly all the sounds, especially in the lower part where the noises are, start to disappear. So this is called noise canceling microphone. It's nothing else but properly compensated figure eight microphone. And I think this is called presence effect when somebody in the actually singles have used this by changing the distance of the microphone, they actually -- what happens is they move the frequency response up and down. It's kind of on the fly, on-the-fly control of the basses. So those are how the microphones look inside. This is not a microphone. This is microphone capsule inside and enclosure. And inside usually we have something like this. Those are omnidirectional microphone. The capsule closed with the diaphragm in front. You can see some small, small holes there, but this is just to compensate for the changes of the atmospheric pressure. And the directional microphones already have well, well visible holes on the back. So it's a kind of a small pipe. Small pipe. And with the diaphragm in front. In general, combination between the connectivity pattern of an omnidirectional microphone and figure eight microphone in certain proportion lets [inaudible] which is between 0 and 1, can bring us to large variety of directivity patterns of the microphones. And this is when we have our cardioid, supercardioid, hypercardioid, and figure eight microphone. So if this [inaudible] which is how much omnidirectional microphone -- portion of the omnidirectional microphone that we have is equal to 0 -- let me turn back to show you the equation. So we're talking about this alpha. If this alpha is 1, the directivity pattern is constant, this is omnidirectional microphone. If this alpha goes down to 0, this is 0, this here will just have a cosign of the angle. This is figure eight directivity pattern. And something in between we can reach those different directivity patterns. Each one of them has a name because of a certain specific property. Omnidirectional microphone is well known because it captures the sound from everywhere. The cardioid microphone is known that it completely suppresses the sounds coming from the opposite direction. Supercardioid microphone has the highest front-back ratio. It means the sensitivity to the image which captures from this half is compared to the image which captures from the front half is highest. This makes it a very good microphone to be placed on the edge of the theatrical scene to capture the artists and not the coughing in the -- of the public. The hypercardioid microphone is known that it has a directivity ratio. It means suppresses maximum amount of noise coming from all directions. And used to be times when actually those directional microphones we had two actual microphones in the enclosure, and you get the true signals [inaudible] and you can actually on the white change the directivity pattern of the microphones. That's long, long time ago. Currently we just have capsules which have a specific directivity pattern. Of course those were theoretical derivations of the directivity patterns. Once we start to deal with real microphones, we start to see some different things. For example, this is a copy directivity pattern of one of the microphone -- directivity for cardioid microphone. And you see the manufacturer says, okay, it's flat. And down to around 300 [inaudible] starts to go down. At 180 degrees we have pretty much 15 dB separation. So the cardioid microphone doesn't go to 0 in the back, but 15 dBs, pretty decent actually directivity. When we get this microphone and actually measure it in [inaudible] chamber, what happens is at first we see that it's not kind of equal. For certain frequencies it has a tail, so it's a hypercardioid. For a certain, it's a pure cardioid, et cetera. So they're not uniform across the frequencies. And the -- at 0 degree the frequency response is not quite flat. We have [inaudible] dB going up and down. But regardless of this, this is a relatively good match between what the specifications is and what the actual microphone measured in [inaudible] chamber shows us the parameters. And everything we discussed so far was about having a microphone freely hanging in the air. Because we can see that those two are hanging in the air, the [inaudible] is very small compared to the wavelength we want to capture. Unfortunately this is not the case when we have the microphones in the devices: telephones, speakerphone, microphone, like a Kinect device. Then the things get complex. And the first thing which happens is that we already start to see quite a symmetric directivity pattern. What you actually see is the directivity pattern of one of the four microphones we have in the Kinect device. Usually having a microphone around -- placed in an enclosure makes the directivity patterns worse. The directivity and how much noise we can suppress is going down. Kinect is one very pleasant exception because in the design and the position and placement of the microphones we actually use the shape of the enclosure to form the directivity patterns of the microphones and actually increase the directivity of the index, increase the directivity of the microphones. So we can basically suppress more noise in advance even before we start to do any digital signal processing. Okay. But in general to make a simple, short summary, a cardioid microphone has 4.8 dB noise reduction compared with a single omnidirectional microphone. So we are already 4.8 dB ahead in suppressing noises and removing the unwanted signals. So it's highly recommended to use directional microphones if we know and can basically guess where the speaker, where the sound source, is coming from. In the devices like [inaudible] it's very obvious because they're in front. And we do not expect it to have sound sources behind. And if there are, we actually don't want them. Those are usually reflections from the wall behind. So this is why all four microphones in Kinect point straight forward and try to suppress the noises coming from left, right, and back. There's speakers, there are reflections. We don't want them. Okay. Now we're ready. Our sound is already electrical signals converted from the microphone. Now let's see what we are dealing with. We'll talk about the noises in the speech signal. We'll see how computers actually see the sound. First, noise. What's noise? Let me play some noise. [audio playing] >> Ivan Tashev: This unpleasant noise is called white noise because it covers evenly all frequencies which we can hear from 20 to 20,000 hertz. It is mathematical abstraction. We cannot see this in the nature. In the nature we can see this. [audio playing] >> Ivan Tashev: This is inside of a passenger plane, a noisy place in general. But this noise has a different sound because it has a lot of more low frequencies. And the frequency stores, the upper part go down. And of course we can have another unpleasant noise. [audio playing] >> Ivan Tashev: Which contains many, many, many, many humans talking together. Besides the spectrum, the frequencies those noises contain actually statistically speaking they are quite similar. All of them can be described, the magnitudes can be described to a Gaussian distribution, which means that this is the simplest statistical model and we can start and play with them. Of course the white noise has a flat frequency, flat spectrum. Well, the other noises pretty much go down towards the high frequencies and have a higher magnitude towards the lower part of the spectrum. So skip this. Speech. Speech happen to be different by many, many, many different aspects. It has nothing to do and is not like the noise. The first thing is that if you tried to do a statistical model of the distribution of the magnitudes, we don't see this to look like a Gaussian distribution. This is the blue line. It's way pickier. So the speech there is mostly around the 0, goes with the sharp, basically big magnitudes. So Gaussian is not quite good statistical model. On top of this, the speech signal has parts with completely different -- with completely different statistical parameters. We see vowels, and those, so we will see, have a very good harmonic structure, and we have unvoiced, fricatives with sh, ps, and those are kind of noise-like. And speech is not constant as the noise but usually it has [inaudible] integral part of the speech signal, it is chopped on parts with a different -- completely different parameters. And unlike the noise, which usually after some reverberation here or there, is kind of [inaudible] speech sources usually points which we can point or get games from there. On top of this, so we discussed that it is not kind of Gaussian, and those are three representations of three different speech signals. First let me just comment what this means. This is time in seconds and this is frequency, up 0, down is 8,000 hertz. And this is the scale, the magnitude. Reddish color means higher magnitude. So what we can see that when you say sh, what we see that we have mostly energies around 3-, 4-, 5,000 hertz. It is not much down and it is not much below 1,000 hertz. On the other hand, it is a -- it's kind of [inaudible] impulse signal. We have a frequency and some energy here, and the rest is pretty much low magnitude. And the last is this is just the vowel. What we can see is that this signal has a very good harmonic structure. It has a main frequency, which is called pitch. This is the speed of vibration of our vocal cords. So then this basic signal, and then vibrate kind of a -- like a hammer basically hitting a surface, which means besides the main frequency we have all harmonic second, two times, three times, four times, five times, up to 20 times in frequency, and then this basically chain of impulses goes through our mouth which with muscles we change the shape and form this envelope. And those maximums in the envelope are called formant frequencies. So from now what we can say is that, okay, we have a maximum around 600 hertz here and a maximum around 2,000 hertz here. Take a generator of those pulses, play through a speaker, and you will [making noise], because based on those formant frequencies, we recognize which vowel is pronounced. Humans have a different pitch. Usually males, they have a lower pitch. Females, they have a higher pitch. But this shape, the formant frequencies are the same. And this is why [making noise] is [making noise] regardless of who is saying it. Okay. So what we can do with the speech signal. We can do noise suppression. Why? Because we saw that speech signal is kind of quite different down the noise signals. So there is a way to distinguish them and to suppress the noise. Noise cancellation is something different. Noise cancellation is removing a noise we know. Classic example is in the car we place a second microphone in the engine compartment. And this is the noise of the engine. We can try to find how much of it goes to the microphone which stays in front of the driver and to subtract it. So this is noise cancellation. More common actually is those noise cancellation headphones in the planes when we have small microphones on the outside of the ear cups so we can estimate how much of the plane noise comes inside the cups and subtract it. In general, all of those class of processing algorithms which we use to make humans to perceive the speed of signals better is called speech enhancement. And one more processing which we're not going to talk about is active noise cancellation. So this is what happens in those noise canceling headsets. Nothing can stop us to -- we have a lot of loudspeakers here, and if we have some low frequency noise going around, to make the loudspeakers to send opposite signal to this low frequency noise. And suddenly we can have a quieter room. That works to certain frequency range, and it's kind of tricky. But it can be used. So we're going to talk mostly about speech enhancement and we're not going to touch noise canceling headphones, we're not going to talk about removing the noise from -- let's say from this room. So once the changes in the atmospheric pressure became electrical signal, the first process which happens with them is so-called discretization. Computers, they work with numbers. And this simply means, ignoring this whole math here, is that we're going to sample to measure the magnitude of the signal with certain -- on certain intervals. And those intervals is called -- those are called basically sampling period. Or, inverted, this gives us the sampling frequency. And there are several people who left their names, but one of the most famous is Claude Shannon who basically said that we should do this sampling with at least two times has higher frequency than the highest frequency in the signal we want to sample. So technically this means that for humans which can hear up to 20 kilohertz, we should do this with 40 kilohertz sampling rate. The standard is 44.21 kilohertz. And this is what we use in most of the recording equipment. At least. Professional equipment goes to 48. We have one 29 already in computers. In the telephony we actually go down because more frequently sampling the signal means more numbers to crunch and to process, and because the speech signal goes up to 6, 7 kilohertz, usually 16 kilohertz sampling rate is considered that that's enough for speech signal, for speech communication. Okay. Second major thing which happens with the speech signal once it is sampled is quantization. So we get our number, the value of the atmospheric pressure or the voltage, but that thing is basically -- has to be converted to a number. And the numbers in computers tend to be discrete. So it can be 1, 2, 3, 4, 5, but nothing in between. And this is why this process of quantization is converting the analog signal, analog value, to a discrete value. And from the moment we did the sampling and quantization, we now have a string of numbers. Pretty much this is what computers work with. We have those numbers with certain frequency which arrive with reduced precision because of the quantization and we don't know what happens in between, but the sampling theorem of Claude Shannon tells us that it is okay because we did the sampling with at least two times higher frequency than the highest frequency we were interested of. Now, next thing, computers also kind of don't deal quite well with stream of numbers. So what happens is next thing we drop those numbers in packets, and those packets are called audio frames. And the size of audio frames is between 80 and 1,000 samples which means usually typically between 5 and 25 milliseconds. So we take this piece and then we try to process it together. Next thing which happens is that this processing happens in so-called frequency domain. We change the representation of the signals. If it take 1,024 samples, as we sample them from the microphone, this is a function of the time. If you use so-called Fourier transformation, we'll get again 1,024 samples. But they represent the signal in frequency domain. So we -- this -- for this particular frame, we know which frequencies we have higher magnitude or lower magnitudes. And because this signal goes as a string, what actually happens is those frames are 50 percent overlapped. You can see here this is one frame. We convert it to frequency domain, do whatever we want here, this is the output, and then it is converted back to time domain. But next frame is 50 percent moving forward. And then they are properly weighted so we can combine them in a way that this signal has no breaks and is kind of properly aligned. So for now on, everything what happens, and we will be discussing what happens here, our input for our algorithms is one vector of 256 or 1,024 samples, in frequency domain the spectrum of the signal, and then we do do some stuff here, and then the rest is this all so-called overlap in our process. This is pretty much the standard audio framework for all audio processing algorithms. Okay. Now, what we can do and how we can remove the noise. I'll try to increase the speed here because we will see a lot of equations. But roughly what happens is that we get this vector [inaudible] which is [inaudible] presentation of -- in frequency domain of the current audio frame. And that's a mixture between speech signal and noise signal. So what we can do is if we know how much noise we have, let's say we get a signal and we know it's 15 percent noise, if you can apply a number, we'll multiply by 0.85, presumably we'll have just the speech signal left. And if you can do this for each frequency bin, K here is a frequency bin, this real gain, which will be different for every single frame -- but if we can somehow compute this thing, then the output will have less noise. Because at least we move the equal portion of the edge. So pretty much this is what happens. We'll heavily use the fact that the speech is with pauses in between. And we'll use [inaudible] called voice activity detector. When the voice activity detector tells us there is no speech, it's just noise, we'll update the noise model. This is our statistical parameter, one per frequency bin, which simply tells us the noise energy for this particular frequency. And then pretty much on every single frame we'll try to compute this suppression rule which is that number, usually lower than 1, which you can apply and then convert back to time domain. So this is pretty much what a noise suppresser does. So we're going to skip here a substantial amount here. Pretty much one separation rule is kind of complex and it's not a function just of the current noise, it's a function of the a priori and a posteriori signal-to-noise ratio. It is something which a lot of smart people work for a long time. But this is already considered the classic, a classic digital signal processing algorithm. And one noise suppresser -- okay, we skip this. And one noise suppresser is something which is already not considered something we should spend time on or just take that algorithm, use it. And it actually improves a lot. The signals on the output of the noise suppresser sound nice. The noise is gone. I'll have to underline here that this is not empirical. If you do a very serious testing about understandability, means you have a signal -- speech signal plus noise. And ask people to do transcription, and then you do a state-of-the-art noise suppression and ask another group of people to do transcription, the percentage of the errors will be pretty much the same. Nothing can beat 100 billion [inaudible] between our ears. Okay. But the difference is that those who listen to the noise suppressed processed algorithm perceive it as better because they load less the brain. So it reduces the fatigue and, in general, it's considered more friendly to the speaker. Instead of we to spend our brain power to remove the noise, the computer does this for us. >>: So what technique of noise suppression did you do for Kinect? >> Ivan Tashev: We'll see a block diagram, partial block diagram what happens. But, yes, we have a noise suppresser at some point after the so-called microphone array. So, now, microphone arrays, what they do. Usually I try to compare the sound processing from a single microphone as trying to do image processing or to make a picture with the camera which has one pixel. This is pretty much it. We have in sample a pretty complex three-dimensional propagation of the soundwave in one single point. We don't know where the signal came from. If we place more than one microphone in a certain mutual position which we know, then we can do more interesting stuff. And this is called microphone array. We'll see that multiple microphones became microphone array when we do the processing and tried to combine the signals together. This is not a microphone array. This is a stereo microphone. We don't do any processing. We take the two signal and Rick records them. That's it. We leave the brain to do the processing. Let's see what we can do with the microphone arrays. So you see if we have -- we have here multiple microphones. In this case we have four. Here we have eight. In a circle geometry, they can be placed into a car. We have four microphones here. And having those multiple microphones allows us to sense where the sound came from. Why? Well, it's simple. Here is the microphone array. The sound comes from here. It will reach first one of the microphone, then second, then third, then the fourth. If the sound came from here, the order will be opposite. The sound moves [inaudible] slow from computer standpoint. This is well, well detectable. The difference between here and here is 7/8 samples or 16 kilohertz sampling rate, which allows us to have a pretty decent sense where is the sound came from. Next thing is to know which direction we want to capture. Once we know this, what we can do is the following. For example, if I want to capture Rob's voice, I know the delays, the sound coming from Rob's direction, and then what I can do is with the processing phase to delay this signal in [inaudible] this last problem, the second microphone to the last one, the third microphone to the last one. Once I send those signals, all signals coming from Rob's direction are in phase and we will have a constrictive interference. However, if we have a sound coming from this direction, even after this sure thing, the signals are with different phase. And we'll have a destructive interference and the level will go down. So similarly with this simple delay and [inaudible] processing, I started to have a better directivity towards this direction and started to suppress the sounds coming from other directions. And because, if you can remember, this noise was kind of spread around us, but usually the speaker is point source, then this is important [inaudible] the beam [inaudible] separates the other signals which are mostly noise or Rob's voice reflected from the ceiling and from the walls, which is reverberation and something we don't like as well. And now have a better signal on the output of this microphone array. So this very simple algorithm, as we can see, has not quite good directivity pattern. This is direction. And this is frequency. For lower frequencies that have pretty much no directivity. Yes. Stores the design of direction. I have a flat frequency response, I'll capture his voice nice and clear. But for certain frequencies I'll have some areas with [inaudible] sensitivity. But in general it's a high-directivity microphone. Nothing more, nothing less. And now I will show you this in action in a quick demo. Okay, one, two, three, testing. So we can see that the device can sense when I'm talking, one, two, three, testing, testing, testing. You see direct line is a [inaudible] sound source localizer which localizes what is the point source and tries to listen towards that direction, suppressing the noises coming from other directions. One, two, three, testing, testing. One, two, three. You can see one, two, three, testing, testing. And we can do something like this. Okay. One, two, three, testing, testing. So I'm let's say three or four meters away from the device. One, two, three, testing, testing, one, two, three, testing, one, two, three, testing, testing. [audio playing] >> Ivan Tashev: So what is the effect? It shortens the distance. If I have a high directivity, I will capture less noise and less reverberation. And from four meters I will sound in the same way as I am [inaudible] distance using a regular just one microphone. That's pretty much it. Reduces no noise, reduces no reverberation, and shortens the distance, the perceivable distance, between the speaker and the microphone. Technically this means better quality. Because close up to the microphone in general means less noise and less reverberation. One more. Let me see how this will work. So what is going to happen here is I'm going to record in parallel. Okay. So first I'll mute the microphones, the PA. And then I'll try to record in parallel, in parallel a signal from one microphone, which is on my laptop, and with the microphone array. And you will see the difference. One, two, three, testing, testing, testing. [audio playing] >> Ivan Tashev: This is the single microphone in front of me. [audio playing] >> Ivan Tashev: And this is the output of the microphone array. So first you see the noise four is gone. Second, I'll play this again, but you can hear how whole, more reverberant is the signal here. And this sounds like I'm closer. And the distance was absolutely the same. Let's hear this again. [audio playing] >> Ivan Tashev: And if you look at the measured signal-to-noise ratio here, we started at 14 dB signal-to-noise ratio. This is measured on one of the microphones. And on the output we had 36 dB signal-to-noise ratio, or we did an impressive 22 dB suppression of the noise without hurting actually much the voice signal. >>: Now, was there any con- -- not conversation, was there any problem with the low end or anything? I mean, it sounded great, but what is the -- what's the ramification on the low end? >> Ivan Tashev: So on the low end, if you mean the lower part of the frequency band, yes. This size of microphone arrays is not quite efficient for a hundred hertz. For speech you actually care about 200 to 7,000 is pretty much sufficient for anything, for telecommunication, for speech recognition. So this is pretty much a low. Below 200 hertz just cut it. Don't process it. Don't waste your time. Just remove it. >>: But it'd have to be twice as big to go to [inaudible]? >> Ivan Tashev: Yes. >>: Okay. >> Ivan Tashev: So technically if you go into the research labs which do audio research, you will see 20- to 36-element microphone arrays, three or four meters long, et cetera, et cetera. But Microsoft Research is a part of research of a commercial company. And, yes, we do basically search. We enjoy those algorithms. But when the time comes to do some prototyping, always ask ourselves, hmm, if we will ever want to make this a product, is this size arrays reasonable or we should stick with something smaller. Okay. Where we were. So just couple of terminology here about the microphone arrays. So this process of combining the signals from the four microphones is called beamforming because it follows you. So the green line follows kind of a listening beam towards given direction. The algorithm I described as delay-and-sum is the simplest, the most intuitive, and of course the least efficient algorithm. By change the way we mix those four signals, we actually can make the microphone array to listen to different directions without any moving parts. It's the same as the sound operator which holds these big directional microphones during making movies and moves it from sound source to sound source, so we can make the microphone array to make this ultimately clear. So this process of changing the beam, the listening [inaudible] is called beamsteering. We can do even more. We can do so-called nullforming, which means I want to capture you but I don't want to capture Rob which talks in parallel, so I can tell the microphone that I mix the microphones in a way to have a pull towards this direction and no towards this direction, so all the direct part from Rob's voice will be suppressed. Of course there will be some reverberation in the room, and I'll capture a portion of it, but most of the energy will be gone. And of course we can do the same thing with no steering. For example, I want to capture your voice but Rob's is talking and moving back and forth. I can steer that null to suppress him constantly. And of course sound source localization is important part of every single microphone array processing because before to point the beam we have to know the direction. So we have the ability with this microphone that with any microphone array to detect what the sound came from. And actually quite precise. We're talking about a couple of degrees here. At four meters my mouth [inaudible] size. So pretty much I can pinpoint the beam to the mouth of the speaker. >>: So you could take that into that New York cafe and hone in on somebody's conversation and you could record all four channels and do it later on however you wanted, right? >> Ivan Tashev: That will be more difficult to do with four channels, but let's say that we can design a microphone array which can do this even couple of [inaudible] somebody sent a link of a gigantic microphone array on top of the basketball field with 512 microphones, and after this you can go and listen what the coach tells the players, you can listen to some conversations between the attendees of the game, and that can happen quite well. >>: Takes a lot more microphone to do that well. >> Ivan Tashev: Yes. Because ->>: Sounds like you could do pretty good with four. >> Ivan Tashev: Yeah, but those -- this is -- the distance is up to three our four meters. Beyond that the reverberation already is too much. And you need a larger array and more microphones and more processing power. >>: You need a minimum of four microphones to make this work, or can it be done ->> Ivan Tashev: Yes. Trust me, if you can do it with three, we'd do it. Every cent is counted in the cost of goods of the device. Okay. Just we see two types of beamformers. One of them is called time invariant beamformer. This means those mixture mixing coefficients for each frequency beam for each channel we precompute offline for beams every five degrees, store them into the computer memory. And then when the sound source localizer tells us 42.3 degrees, we take the weights or the coefficients for the beam at 40 degrees, and because five degrees are kind of the beam width, we can pretty much cover the entire space without doing any computation, serious computation, in real time. This works if you assume that the noises is isotropic spread around us with equal probability from each direction. And we cannot do this on the fly, no steering can, when we have a deserving sound source. Adaptive beamforming actually I will turn a little bit to the slide. So after a lot of [inaudible] we came up with this formula for the weights, the mixing coefficients. The adaptive beamformer, it's absolutely the same. The only difference is that those noise models are updated on the fly. At every single frame we recompute the best mixing matrix, and this is why when we have somebody talking we want to capture, there is another person which talks and we don't want to capture, after less than a second, our directivity pattern suddenly changes. You see here we have a desired sound source, it's all red in the directivity pattern, and we have somebody who is talking and we don't want. And the computer actually says that, okay, there is somebody, some sound source we want to suppress, and apply it then. This is typical application of convergent adaptive beamforming which places a [inaudible] undesired. So there's a direction and have a pull towards the desired direction. So we have four microphones. And means we can satisfy four conditions. The first is towards the listening direction we want, you need gain and zero phase shift. So I have three other conditions to play with. If we have a second source, we can have another null. If we have a lot of sound sources, we'll need a lot of microphones. This explains why we need 512 microphones for that basketball field. Okay. And one more -- just I'll do a quick demo. We can go a little bit further and we can do localization for the sound for each frequency beam and do some additional beyond just a linear -the bilinear [inaudible] which was just summing of the microphone array signals from the microphones weighted in a certain degree. And what we can do is we can apply a weight. So this is kind of a spatial suppression. And this is a simple illustration. [audio playing] >> Ivan Tashev: We have a human speaker here, an array here tucked in here. This is frequence. And of course a lot of noise. So just for the sake of the experiment, what I did is that, okay, everything which is plus/minus 20 degrees goes through. Everything beyond this put again zero. We don't need those signals. And this is what we have in the output. [audio playing] >> Ivan Tashev: So the radio is gone completely. We have two sound signals in the same frequency band, and the only difference is that they came from different directions. Using the microphone array, we can separate and suppress those which we don't want. Her voice sounded slightly distorted, and that's, again, want to repeat, the filter was just like this. We have way more sophisticated statistical methods to introduce [inaudible] distortion as possible and still to suppress whatever we don't want. So pretty much this is what we just saw as a function of the time. And this is direction. This is the radio here. And this is her voice. You see that we have enough resolution to let this go through and to suppress this in the terms of directivity index, the blue line is just the beamformer, and you can see that just spatial filtering actually gave us around 15, 20 dB better directivity index. It's time for 15 minutes break. Go around. And, by the way, I will switch to the Kinect device, so we can use the break to play some games. I brought actually a real Xbox with real games. And in the second part we will continue to see how this thing works. Enjoy meantime the device. Do some socializing. And after 15 minutes we'll get again together. There will be a raffle to see who is going to get the prize, the door prizes here. In short, 15 minutes break, people. >> Bob Moses: Okay. If we can get people to come back and sit down, we're going to do our raffle now. Must be present to win. We have four Microsoft pens. That will be the first four winning prizes. Oh, Brian Willbury. Willoughby. Sorry. [applause] >> Bob Moses: Okay. Where's Rick? Do you want to write these down, Rick, who wins what? Greg Mazur [phonetic], come on down. Also known as the cookie lady. Greg is responsible for your treats. Thank you, Greg. Rick [inaudible]. Travis Abel. Congratulations. Okay. Next we have a one-gigabyte flash drive thing. Scott Mirins. Next we have a copy of Windows 7 Home Edition. Ken Kalin. Did I pronounce -- sorry for all the names I'm butchering, by the way. Windows 7 Ultimate. Gary Beebe. And the grand prize, this is a copy of Ivan Tashev's Sound Capture and Processing book, and if you're nice he might even sign it for you. Steve Wilkins. >> Ivan Tashev: Okay. We'll start to move quickly. So so far we're dealing with sounds we don't know about, those other noises. We don't have to do with them anything. They are just outside of our sound capturing. There are also group of sounds which we know. This is what we send to the loudspeakers because when you watch a movie, there is a quite loud sound coming from the loudspeakers. When you talk with somebody, he's -- or the other voice from the other side came from the loudspeakers. It would be nice to be able to remove those sounds because we want to capture only the local sound. And this process is called acoustic echo cancellation. This is one of the first -- the most frequently used, one of the earliest signal processing algorithms. What happens is that this is part of every single speakerphone and every single mobile phone. This scenario is very simple. We have two rooms, microphone, loudspeaker, microphone, loudspeaker. This is called near end room, our room. The other is called far end room. If you don't like this, what is going to happen is this guy speaks, it is captured, transferred, it's transmitted by this loudspeaker, captured by this microphone, played back in the near end room, captured by this microphone. And what happens is every word ends up with kind of echoes. In the worst case we can even have a feedback, which frequently happens when you place a microphone close to the speakers. So if you can somehow remove from here the sound we sent to the loudspeaker, then that will be nice because on the output of this acoustic echo cancellation we'll have just the local sound, the near end sound. And this is pretty much usually, in most of the systems, happens in two stages. The first is known as acoustic echo canceler. That's the first application of the so-called adaptive filters. So roughly what happens is that we send this signal to the loudspeaker, and it goes and is captured by the microphone [inaudible] with a function called room impulse response. Not the same for all frequencies, and for some frequencies not at all but in general after filtering this signal ends here. And of course there is a local speech and the local noise. If we can somehow estimate this filter, then we can go and filter the signal we sent to the loudspeakers and simply subtract it. Theoretically we will end up with just the local speech and the noise, because if our estimation here is correct, we'll be able to subtract the entire portion of the loudspeaker signal. This does not happen usually. Regardless of the fact that, yes, you can see this filter is adaptive, we carefully watch the output and tried to trick the coefficients of the filter on the fly so we can minimize output here. And what is left here is so-called -- from the loudspeaker signal is so-called residual. And usually the typical well-designed acoustic echo canceler removes 15 to 20 dB of the loudspeaker signal. So it's 20 dB down, still audible. What we can do more. Then the next reasonable step, if you guys remember, we had the linear beamformer and then it was that suppresser thing which we applied gains between 0 and 1. Same thing follows. We already did what we could dealing with the signal as a complex number with phases and magnitudes. What's left is we can do some estimation of the energy and directly suppress it, estimate again between 0 and 1 for each frequency build, and try to suppress what's left from this energy in the same way as the noise suppresser of this, except that instead of having a constant noise model here we do on-the-fly estimation how much of this energy is still left here. And this is called acoustic echo suppresser. So the combination between acoustic echo canceler and acoustic echo suppresser usually is -- name it as acoustic echo reduction system. So far we had one loudspeaker and one microphone. That's a perfect speakerphone. And this is pretty much what happens in every single speakerphone. In addition, what we can do is here we have one more [inaudible] which is the stationary noise suppresser, the first signal processing algorithm we discussed. And, voila, you have a pretty nice and decent speakerphone. Okay. That's good. So now we can do speaker phones, and the question is can we do stereo. The first idea that came up is, okay, we'll chain to acoustic echo cancelers, one of them will get the left channel, the other will get the right channel, we'll subtract whatever they capture, and we'll have our echo suppressant. This does not work. And the reason for this is that those two signals are highly correlated. Pretty much on the two left and right loudspeakers, if I want to play a sound source right in the middle, I'll hear them the same. But if the sound source is more on the left, we have it more on the left loudspeaker channel, but pretty much attenuated and delayed signal will emit from the right channel. And this means that the acoustic echo canceler, if you do it in this way, will be chasing those phantom sound sources, not the actual signals from the speakers. Because of this correlation, we -- okay. Let me put it in this way. Those adaptive filter tried to find a solution of a mathematical problem. We have two filters to estimate and just one equation. So pretty much this means that we have infinite number of solutions. And the chances our two adaptive filters to converge to the right solution, so if the proportion between left and the right channel changes and [inaudible] converges is very minimal. What happens in reality is that the acoustic adaptive filter tried to change those phantom sound sources, they constantly tried to find a better solution, and, as a result, we have a very bad echo suppression. It goes down 3, 4, 5 dB, somewhere there. So potential approaches, and they're basically written all over specialized papers, is what we can do, can we de-correlate the two loudspeaker channels. One of the guys actually suggested, really suggested in a scientific paper, what if we introduce 10 percent nonlinear distortions on the left channel. This is sufficient to de-correlate the channels, and, voila, we have those two adaptive filter converging properly. 10 percent THD is painful to listen, actually. So in this particular direction, people continued to see, okay, what if we use the psychoacoustics and start to introduce inaudible distortions, distortions which humans cannot hear, and still to keep those two channels de-correlated in the way sufficient for those two filter to converge. And there are potential solutions in that area. There are other papers which say, okay, those guys they cannot converge with just the acoustic echo suppression, that's costly because it introduces nonlinear distortions. What we do in devices such as Kinect is try to leave with [inaudible]. We have a microphone. We have one adaptive filter. That's it. We cannot have more than one. Then the question is can we somehow learn the way signals are mixed in the microphone. And then to mix the signals from which we send to the speakers and to use just a single adaptive filter to compensate for people moving around and changing. Because my loudspeakers are there, the microphone is here, but when I move, this pretty much the sound reflects, bounces from me. And this slightly changes the filters which I have to use. And hopefully this guy can compensate for this. Initially what we do is [inaudible] signal. This is pretty much a linear frequency increased from 200 to 7,200 hertz and just estimate those filter during the first calibration phase. That was working perfectly to the moment. My colleagues in Xbox time said this sounds ugly. This is not for a consumer product. So this is why today when you go to Xbox and listen, you'll hear this sound twice. [audio playing] >> Ivan Tashev: So this music was carefully selected to have every single frequency between 200 and 7,200 hertz presented. It sounds way better than the ugly linear chirp. And this happens during the installation of your Xbox device and Kinect device. And then we load and store this mixing matrix and use one single adaptive filter in real time when you watch the movie, and there is a sound and shooting and you want this Xbox [inaudible]. This is what happens, that on each of the four microphones we get rid of most of the echo here. So this is one of the things we just brought you. I want to underline how I'm part of this algorithm. The inventor of the acoustic echo cancellation himself from Bell Labs in 1991 wrote a paper stating stereo acoustic echo cancellation is not possible. And in 2007 we demonstrated a working solution during our Microsoft Research TechFest, and in 2010 it's a part of [inaudible] product. So it's a little bit more complex than it looks on those simple diagrams, but I will use this just to make a simple demo here. So this is what one of the microphones capture. [audio playing] >> Ivan Tashev: And this is what we have on the output. [audio playing] >> Ivan Tashev: So as we can see, we can get rid of most of the sound from the loudspeakers. I want to underline again how challenging is this. When you stay and use this algorithm in your conference room, usually you set up the level of the loudspeakers to a level which is human voice. We speak between around 60 dB SPL; this is how much you will set the loudspeaker level. When you sit in your gaming room and start to play a game or to listen to a movie, usually the sound level is around 75, 80, 85 [inaudible] we measured crazy people which listen at 90 dB SPL. 90 dB SPL is 30 dB above your voice. This is 30 times more energy comes through the speakers than from your voice and you have to get rid of those and to have your voice at least 20, 25 dB above the noise flow, what's left so you can get distant communication in speech recognition. Okay. What you can do next. So, again, this talk is not just Kinect; we are talking a lot of algorithms which are sound capturing related. One of the classic algorithms which is floating around is so-called sound source separation. So we have audio channel with two voices mixed. Can we separate them. Yes, we can do this even with a single channel. A microphone array with more than one channel gives us additional queues because at least those two voices are in different directions. And we have pretty much two separate group of algorithms, one of them is so-called lined source separation. But mathematically speaking the question is we have this signal, can we split it in two in the way those results -- we maximize statistic -- maximize statistical independent. Presumably voice 1, voice 2. This is just uses the fact that the two voices are statistically not correlated. The second is that usually those two people are in two different directions. So can we use the microphone array to say, by the way, one of them is here and one of them is there. So to enforce, because we use different properties, presumably this happens in very real cases in signal processing, but those two [inaudible] thinks eventually the effects of them can sum. Usually you have one algorithm which does separation 10 dB and the other does 10 dB and we chain them and instead of 20 you get 12. So the good things not always sum, but in this particular case, because we have to do combination of orthogonal features, we can expect that this is going to help them. Actually it happens. We combine those two algorithms here just to see how much we can separate them. This is distance in meters. And this is the angle between the sound sources. Roughly think about this as a four-seat couch, 1.5 meters. This is person 1, two persons in between, this is the most left and the most right person. You move it further, the distance between them is the same. But because it goes further, the angle shrinks. So the same couch at four meters, we have just 25, 26 degrees between those four, two outer persons outside -- or outer persons on the couch. And pretty much this is how many dB you can separate those two sound sources. Even can separate the other guy 20 dB. This is already quite good. And then this happens up to 2.5, 2.7 meters. >>: Can you define SIR and PESQ, please. >> Ivan Tashev: SIR is signal-to-interference ratio. If they are the same equal loudness, how much you will separate one of them. PESQ is something which most of the audio people are not familiar with. And it stands for perceptual evaluation of sound quality. So this is a standard which comes from telecommunications. There is ITU-T, International Telecommunication Union, Recommendation P.800 to which is called MOS testing. How we can evaluate how good is the sound quality, how humans perceive it. And it's as simple as this. You get 100 people, ask them to listen to set of recordings and to evaluate them from 1 to 5. Then you average. And this is the mean opinion score. It's a number between 1 and 5, where 1 is very bad quality, I can not understand, and 5 is crystal-clear quality. This is time-consuming. This is something which you're not willing to do quite constantly. PESQ is a signal processing algorithm which is a proxy of this type of measurement. So we have the clean speech signal, we have the [inaudible] signal running through this signal processing algorithm, which is a standard in AT&T. And you get the number from 1 to 5, which is very highly correlated to what you can do with real people. What happens in 1/10 of the time of the duration of your recording. So pretty much increasing the PESQ means you get a better perceptual quality. 0.1, 1. is audible if you are in the business. Not expert on audio, but can listen well. 0.2, improvement in PESQ. My grandmother should be able to tell the difference. And here we talk about 0.5, 0.4. So apparently the perceptual quality, improvement is well audible. Again, you do heavy processing, you improve the quality of the signal you want, the heavy processing introduces distortions which reduce the quality. It's always question of balance. And this is why we carefully, when we do signal processing algorithms, especially for [inaudible] devices, we carefully watch this coefficient. Because [inaudible] is the feedback, how did we overdo it. If you overdo it, yes, the nasty signal you want to suppress is gone. But the one you want to [inaudible] already sounds very bad. So PESQ is actually one of the measures how well we do. Usually at some point you tighten the bolt of the algorithm, it goes up, up, up, and at some point you overdo it, PESQ start to go down. This is the moment you have to stop and return a little bit. >>: So is there any intelligibility measure as a standard? >> Ivan Tashev: There are intelligibility measurements. They are mostly on humans and pretty much ->>: There's nothing ->> Ivan Tashev: As far as the intelligibility is considered, the only thing we can do is to decrease it. Pretty much you cannot beat the human brain. >>: No, no, no, I mean is the algorithm similar to PESQ. >> Ivan Tashev: Oh, automated. Nope. So skip this. And listen to demo. We have two persons shoulder by shoulder, 2.4 meters away speaking. One of these in Chinese, the other in Korean. [audio playing] >> Ivan Tashev: And then we do our magic using the power of the microphone array and independent component analysis. [audio playing] >> Ivan Tashev: So this is not selecting one voice from 500, but still 2.4 meters with all the reverberation and noises in that room to be able to sense out the qualities might be not great but sufficient for speech recognition, sufficient for telecommunication. Technically this means I can sit on the couch with my wife and we can watch the two channels simultaneously. Most probably she's on the large screen watching the soaps and I'm the small screen, the latest football game. And we can speak and send voice commands, and for each one of the screens they'll be executed separately. So pretty much this is one of the potential applications of this algorithm. >>: Ivan, how much dB do you do differentiations at? >> Ivan Tashev: So what we just listened to was around 21, 22 dB. If we turn back to the chart, we're talking 2.4 meters, 26 degrees, yes, around 21, 22 dB. Pretty much if you go closer here, you're 28. This is one -- another interesting thing. If I add 3 here, the blue is what we already have published in this area, and people were claiming, wow, we did 1 dB. First time we tried to publish that, we achieved in the same conditions 20 dB, our paper was rejected. This was not possible. So, again, we do a lot of complicated signal processing. Why? Why? What we should do? What are the applications? And here is a typical application for cars. You guys have heard about Ford SYNC, Kia UVO, Nissan LEAF, all those three cars share one same property: Their in-car entertainment system runs on Microsoft Auto Platform, which, on top of everything, contains audio stack for telecommunication and they do speech recognition. Telecommunication. Okay. This converts the car audio system into gigantic [inaudible] headset. For those -- the driver who listens to the nice and cool audio system in the good speakers, that's a pleasure. Unfortunately, the noise in the car is not loud. That when you do acoustic echo cancellation and now noise suppression and then do encoding and decoding to get you to the GSM or CDMA telephone line. On the other hand, people are not happy, and I can play you what they hear. So this is what we have in the car. [audio playing] >> Ivan Tashev: And, on the other hand, what we have is... [audio playing] >> Ivan Tashev: So I selected the segment when the noise kind of slowly increases. This is the car accelerating on the highway. So you notice that those brakes -- this is not framed [inaudible] during the transfer of the audio. This is just the encoders refuses to work correctly, so bad signal-to-noise ratio. The standard GSM encoder is not adapted to and designed to work on so high -- so low signal-to-noise ratio. Second thing we notice is that those all processing blocks out optimal in a different way. Some of them are medium mean square error, lock medium mean square error. PESQ is for codecs. And then the idea is, okay, what if we go and optimize end to end. So we have a substantial amount of recordings we do and process them through all of those blocks, including encoder and decoder, simulating the entire telephone line. And here we measure PESQ. And then the question is can we go and start to tweak the nuts and the bolts and the algorithm to tune it in a way to maximize PESQ at the other end of the telephone line, to maximize the user experience end to end. And of course because the noise is too high, what we're going to do is to add one microphone array, two, maybe four elements to do the acoustic echo cancellation first on each of the channels, to our beamformer and the noise suppresser, and then this line, entirely optimize it, brings us something like this. If I can find my cursor. Okay. [audio playing] >> Ivan Tashev: So first thing you notice it, we can hear the sound in the car. And it's completely fine. As for us, the voice of the speaker has maximum perceived quality. >>: Back at the load it should be easy to suppress [inaudible] just like in your earlier case, two speakers ->> Ivan Tashev: Every suppression has a price. You squeeze 14 dB noise suppression, that's already on the border. You go 20, and you start to have badly ->>: The earlier case when the distortion is much less. >> Ivan Tashev: This is the same audio signal. Just those 6, 7, 8, 12 dB we get from here actually were now helpful to get the encoder to work properly. And pretty much the optimization process stop it from the moment the encoder started to work properly without overdoing it. This is the key idea here, that you do optimization end to end and get the maximum of -- in terms of perceived sound quality. So what we can do and why we have to do this in the car. Because when you drive, your hands are busy, your eyes are busy. What's left is ears and mouth. So speech is kind of a good media to communicate with the computer system of the car. And we are not talking about the computer who does the ignition or controls the gearbox; we're talking about the computer which handles your music, your telephone, stuff which is related to the information and entertainment of the user. So speech is a good modality, a good component of the multimodal human computer interface. It contains buttons, it contains graphic screen, still it's okay to kind of look at. In general when you design such systems, the question you always have to answer is do I want drivers which I share the road with to do this in their cars. So without going very further, I'll show you short movie segment of what we have designed here in MSR. >> Video playing: [inaudible] principal architect in Speech Technology Group in Microsoft research. We're in a driving simulator lab. And I'm going to demonstrate CommuteUX. CommuteUX is a next-generation in-car dialogue system which allows you to say anything anytime. This is a dialogue system which employs natural language input. And the system still will be able to understand you. Let's see how it works. At the beginning of our trip, let's start with listening to some music. Play track Life of the Party. So this was the classic way to listen or to set the speech query. But we don't have to specify track or artist. Play Paul Simon. So as we can see, the system didn't need the clarification about the artist. Play the Submarine by the Beatles. And it can even understand not correctly specified names. Let's make some phone calls. My Bluetooth telephone is paired to the system, and the computer knows my address book. So I can ask directly by name. Call Jason Whitehorn. "Calling Jason A. Whitehorn; say yes or no." Yes. "Jason A. Whitehorn; do you want cell, work, or home?" Cell. "Calling Jason A. Whitehorn at cell." So this was a classic and slightly painful way to interact with the Bluetooth phone. But our system can understand and doesn't need disambiguation if the name is unique. For example, call Constantine at home. "Calling Constantine Evangelo Lignos at home." But there is even more. Frequently the driver needs to respond to some urgent text messages, and it's definitely not safe while you're driving. So we can use our speech input to do the same. "Message from Juan: ETA? Say reply, delete, call back, or skip." Reply: In 10 minutes. "Am I right that you want in 10 minutes, or a number for the list?" Yes. "Got it. Message sent." Maybe you notice it, but there is some irregularity in movement of our car, and I'm afraid that we have a flat tire. And because this is a rented car, I even don't know how to open the trunk. So now is the time to open the owners manual and to try to figure this out. But instead of this, our computer already did this for us. So we can just ask. How to open the trunk. And we see the proper page here. But then we need to replace the flat tire. How to replace a flat tire? And we go directly to the page of the owners manual which it describes how to replace the flat tire. Once we have our flat tire fixed, we can continue our trip. And listen to some music. Play Nora Jones. >> Ivan Tashev: So two things here. During -- so, you know, on film, on, you know, movie you can do everything. This was done on the fly pretty much with one shot. And we didn't have much training to do this. I had two or three persons in the driving simulator project with some microphones behind me, and I made a mistake only once when I almost crash the car. But otherwise this is kind of a proof that you can drive and to operate the system with less distraction than usual. In all cases, responding to text messages in this way is way safer than trying to punch the keys. Yes, it's forbidden in most of the states and still most of the teenagers do it. And the reason here isn't -- the question is not do we want to drivers which we share the road with do the same -- to do this, it's more do we want them to type or them to respond with voice at least when -- on most urgent messages. Anyway, this is a kind of demonstration of one of the application of capturing sounds, not only for telecommunication, but for one of the modalities to operate with the computer. And of course the next thing is we sit in our media room and try to find the remote control or argue for the remote control, and the question is can we use our voice for this. And one of my favorite [inaudible] basically the man is laying on the sofa and the remote control is on the other end of the coffee table, and the guy basically lifts the coffee table so it can slide in his hand [inaudible]. So now with devices like Kinect you don't have to do even this. So what we have in Kinect. So we have that multichannel acoustic echo cancellation, something which no other company has [inaudible] algorithm so far. We have a pretty decent beamforming and spatial filtering. And we have eventually in some case this sound source separation algorithm. What Kinect has in addition to this, there is a 3D video camera. Okay. There are two cameras: one of them is invisible and the other is infrared depth camera. And this four-element microphone array. The depth camera is another new, pretty much evolutionary thing in the area of human-computer interaction. It works in complete darkness and allows us to do very, very -- okay, not very easy, but allows us to do gesture recognition. What happens is that once you stand in front of the camera, after some processing, your body is assembled and you have XYZ of each of the major joints. Initially there was just the head and the two hands, but later we can do the knees, ankles, even if there is a woman which wears a long skirt, we are able to find the knees and the ankles so we can track the body. Then gestures like this is if the Z of this joint is bigger than the Z of this joint, boom, we have a hand raise. Gesture like this, if the Euclidean distance between this joint and this joint are closer than given distance, we have a clap, et cetera, et cetera. Most of the gestures we operate are quite simple to program after both these sophisticated skeleton tracking layer, which is part of the new XDK platform. And this allows game developers actually to create very fancy and interesting games. We will see some of them in the demo. But before we to go there, couple of minutes of illustration of what we can do in combination -- no? Okay. Let me find this. [audio playing] >> Ivan Tashev: So, in short, having the depth camera, combine it with the microphone array opens new ways to communicate with the computers. I'll demonstrate quickly the gesture recognition in the Kinect dashboard so we can see how we can go select, start a game, et cetera, and create a pretty much sophisticated and easy and intuitive ways to -- humans to communicate with the computers. Yes, we can create very funny games. I'm not going to dance and I'm not going to play Kinect Adventures tonight. But mostly I think the value of this combination between sound and gesture and speech recognition games that we can communicate and talk with and interact with the computers in a way more natural way than we can do with keyboard and mouse as we are used to. So the last thing, if you guys haven't seen, this is the book which I published last year. If you want to go a little bit deeper and to play a little bit with those algorithms we discussed, there are some MATLAB [inaudible] there, so there are some .wav files so you can go and do something by yourself if you're interested of this. Before to go to questions I want to show you some things here. So you can see my image in the depth camera. Now I have gestures and control and can go -- I think the game will start, but we'll try to -- okay. One of the ways to interact is combining basic using gestures. So the game is Dance Central. And technically it's just you have to dance together with Lady Gaga and other. But the only thing I want to show here is the way we can select and to use the menus. [audio playing] >> Ivan Tashev: Okay. I can see the [inaudible]. Okay. And then dance select. Who's playing. And, for example, look, I have my left hand here for the left [inaudible] moves. I can have my right hand. No mouse, no controller, nothing. And this is where the dancing begins. So pretty much very natural, no [inaudible]. [audio playing] >> Ivan Tashev: Who's going to dance? Come on up. Come on. Come here. It's easy. Trust me. Come on. [applause] [audio playing] >> Ivan Tashev: Okay. Watch one of your hands. Okay. Yes. Who is one of us. Screen control. Okay. That's -- so try like this. Resume. Just repeat it. If you see your head, you're not doing it. Very nice. [audio playing] >> Ivan Tashev: Nice. >>: [inaudible] virtual mixture one? >> Ivan Tashev: Which one? >>: You know, the virtual -- you'll have a virtual mixture so you can mix, you know, 64-track mix. >> Ivan Tashev: It's possible. >>: You just don't want to slip and fall down [inaudible]. >> Ivan Tashev: And just one more quick demo. For this I'll need to mute the microphones. Okay. So you see I'm almost ready. Xbox. Kinect. I don't have to touch the controller, I don't have to make gestures. >>: [inaudible] microphone here. >> Ivan Tashev: This microphone is off hopefully. What we use is the Xbox microphones there. So pretty much you have to say anything, just proceed the command by the keyword Xbox. This is another thing which was a difficult research problem 18 months ago. Open microphone and no push-to-talk button. You guys remember in the car I pushed the button. Say whatever I want to say means I have to tell the computer I'm talking to you. Now the microphones listen to my voice. There is no reaction, but when I can say Xbox, Kinect ID. Is the microphone active? Xbox. Kinect ID. So pretty much everything I see on the screen in black and white is a command I can say, et cetera, et cetera. It will start the identification software, which I can go. This is a different way to select. You pause and hold. And I want this different profile. And then go and sign for this guy. And this is just a quick demonstration, I know many people like this, to see how you move and you control the avatar. And that's the last thing I'm going to show you tonight. [audio playing] >> Ivan Tashev: So as you can see -- oh, you want me here. Okay. [audio playing] >> Ivan Tashev: Okay. This ends the whole story. So that was a demonstration of strange and still not quite known way to communicate with the computer. I'm open for your questions. I promise you to stay let's say 20, 30 minutes after the talk to start some of the games. You guys can play Kinect Adventure, Jump, et cetera, et cetera. For me, pretty much not into the gaming. This device breaks two barriers. The first is the gender barrier. The image of the gamer holding the controller sitting, usually male, early -teenager, early 20s, and shooting the enemies is gone. Dancing is going to open the gaming for the second gender. And the second is that, for God's sake, at some point, we have to stand from the couch, stand in front of the device and start to move. And this is for every single person, regardless of the age. So it breaks the age barrier as well. That was it. Thank you. You were a very, very, very good audience. Questions, please. Here. >>: So what are other [inaudible] of operating besides command and control? Are there anything ->> Ivan Tashev: So there are games -- so think about Kinect and in general Xbox is a kind of operating system. It exposes certain interfaces. Whatever the game companies decide to put, they have access to the speech interfaces, they have access to the skeleton tracking, and then this is their fantasy. It's tricky. Speech is not on the level in speech recognition which is the graphics and the control handling. So for on the games we have today available for Kinect only to use voice and speech, one of them is Kinectimals where you can name your animal and send some commands and the animal listens to you. The other is -- actually the biggest surprise on the market, an extremely good speech recognizer, very rich grammar and very good communication with the computer, how they did this, we don't know. Because it's not a Microsoft game. But that's pretty much -we're going to improve the underlying technologies, but what and how the gaming companies which create the games are going to do with the speech gestures, et cetera, et cetera, it's up to them. Go ahead. >>: [inaudible] was a customer able to upgrade their firmware when new versions come out ->> Ivan Tashev: Yes. >>: -- in Kinect itself? >> Ivan Tashev: Yes. The firmware is -- pretty much the device has only booter. From the moment you connect it to your Xbox, it downloads the latest, greatest from Internet and pushes through the device. >>: Using -- if you have the camera to know -- or the microphone array to know the location of a person, and I turn my head, how accurate can you tell the angle that I'm speaking at the microphone using that system? Is that -- what I was imagining was Dune where you yell the direction you want to send your blast the science fiction book. In order to make an attack, could I be -- have an opponent and yell that direction and shoot something? >> Ivan Tashev: The gesture can do this. We do not have a technology part of the XDK which allows you to guess the head orientation. I'm not saying is it not doable, I'm not saying we don't have the technology. It's not available in the existing ->>: [inaudible] with the microphones, though. >> Ivan Tashev: With the microphones, what they can do is to estimate the direction means this angle and this plane. So it can tell you where you are, but it cannot tell you where is -- what is the orientation of your head. And partially you can do this with the depth camera, but that's something which it's not part of the interfaces in the technologies XDK exposes. The game companies, can they do this? Yes, they have access to the raw data. That actually gives them advantage because they have something which [inaudible]. It's doable. It's not in the existing XDK. >>: I have two questions. The thing you did at the end there where you were stepping from green square to green square and holding poses, was that some kind of calibration? >> Ivan Tashev: So this is Kinect identification. What the program was doing was trying to estimate my biometrics, the size of my bones, and this allows this program later when I just stand in front of the device to say this is Ivan. This is not as difficult as it sounds because in your family you have three, four, five people, so usually different size. So you can relatively and reliably do a person identification based on biometrics. It is not a serious problem. It is not in the existing XDK. But recognizing by voice is another signal processing algorithm which we can use. And, by the way, the combination of biometrics plus voice, which are orthogonal again, can give you in combination a very reliable identification pretty much beyond the need of what the -- the casual atmosphere. And you play a game, it's a trivia game, you just sit down and the scores goes to your account, yes, you need maybe 97, 98 percent recognition. And you can do this by voice with the biometrics or with biometrics without voice. So it's kind of reliable. But eventually at some point we may have [inaudible]. >>: And in the earlier example when you were driving in the car and telling the thing to play a song, what if you said -- were having a conversation and you said I'm going to the play tonight. Would it immediately grab on that word play and play the song Tonight? >> Ivan Tashev: So to a certain degree yes. This is why I said this is a natural language input. So what happens is that we go one degree beyond just recognizing the words. We try to understand the meaning. Even a perfect speech recognizer, even today's speech recognizers are better than the humans. The major source of irregular queries or not quite -- these are ambiguous queries is the human. So we try and give certain freedom. It's not as good as just you're having a conversation, but in general you will tell our users you have to say first the command, what you want played, and then what you want to be played. Even we tell them play, and then you have to say artist, album, track, but usually they are mistaken. We hear commands like play track B of Beatles or something like this. On the back side, we ask the users to stick with this structure. But on the back side we can actually handle stuff like play me -- what was that song about somebody's -- and usually this will get you the Yellow Submarine. But we don't advertise this because this is our basically backup, our [inaudible] to handle human errors. So technically most of the cases it will handle it properly. How good will be to insert entire conversation and just [inaudible] extract some words, maybe not. >>: But from the car you have to push the button before you ->> Ivan Tashev: Yes. It happens once just to initiate the conversation. So presumably I'm talking to the passenger, want to talk to the system, push-to-talk, the button, then the conversation of the computer starts. Presumably we can exchange couple of phrases, the computer can ask for disambiguation, et cetera, but there's the conversation, the opening and closing a task. The task can be call somebody, can be play that song, et cetera, et cetera. During this you don't have to push the button again. But at least once you have to signal I want to talk with you. This is not the case with the Kinect. This is another brand-new thing unknown so far, very difficult to achieve in the research environment. And it's not commercial product. >>: In the car, once you've -- once the computer's completed your request, does it continue to listen to you if you want to make another request? Or do you have to push the button again for a new request? >> Ivan Tashev: It's all task level. You have a certain task, you signal the beginning, you talk, finish the conversation, end to the task, the computer sleeps. >>: Okay. >>: Ivan, does it learn when -- in the car when you're making requests over time, does it learn your terminology like a smartphone does? >> Ivan Tashev: So it's a question about how much your smartphone does. I think one of the next things we want to do in this Microsoft Auto Platform is -- and actually it's valid for Xbox as well, there's more recognition who is talking and the [inaudible] the speech recognition model towards this particular person. Then once we have -- let's say I go to the car, I do by opening, because keys are different, they have unique code, or by some biometrics, but my voice, it recognizes me, loads my acoustical models which are adapted for my voice. And then next step will be, oh, and by the way, this guy have a very -- this guy have a very funny way to express the tasks, but I can adapt. Base it on the feedback. If he confers frequently, I can put this in start to learn. But this is -- we're not there yet. And pretty much the first adaptation towards the person on the acoustical level is the first thing which we can and should do before to go to that level. Rick? >>: How much is the speak recognition affected by accent? >> Ivan Tashev: Okay, it works for my accent. In Speech Technology Group in Microsoft Research, we have from 10, from 11 people -- how many? -- one or two native speakers. >>: Two. >> Ivan Tashev: Two native speakers. >>: Three, three, three. >> Ivan Tashev: Three. So the eight of us are born outside of the U.S., pretty much on every single continent. They work for us. So we train the speech recognizers. >>: How much of the process is -- is all of your audio processing done in the box? Or is some of that done in the Xbox, same with the video positioning? Is that done in the box, or is that done in the ->> Ivan Tashev: Both. Actually, for in my cases we have several instances of the audio pipeline running. When you do a game chat, means you're playing how, et cetera, et cetera, and your realtime communication with somehow playing this on Xbox Live, games they don't want audio processing to steal any CPU cycles because they are busy rendering something that are complex graphics. Then we have a specialized [inaudible] CPU inside the device. It's not very powerful. It's a mobile phone-style CPU 2005 [inaudible] 200 megahertz. And it can handle the entire audio stack. For speech recognition, this happens on the console. We drag the four channels down because the audio stack for speech recognition is a little bit more sophisticated. We do more processing there trying to clean up because speech recognizer in general is more sensitive to the [inaudible] and the noises. And when you do voice chat using Windows Live Messenger, and you can do this from your room, and I actually appreciate this very much because I sit in my media room and I talk with my opponents across the ocean, then there is a -- the same voice audio stack [inaudible] on the device [inaudible] on the console because it's kind of pointless to send it to the device. So several -- three variants of the same code [inaudible] executed either on the console or on the CPU -- on the device. The video up to the depth camera. Processing happens inside the device. There are three or four DSP processors which handle the video and the infrared. And what we drag down is black-and-white image. Every pixel is not brightness but distance. All the skeleton tracking happens on the console. This is very sophisticated, complicated algorithm. It cannot run on the device. It can, but it should be two times bigger and one more fan to get rid of the heat from inside. So this is pretty much where we are. >>: What's the maximum distance that the device can be from the console? >> Ivan Tashev: It's a USB. You already can buy from Amazon a USB cable extender for Kinect. So 20 feet? I don't know. Most probably. Maybe a little bit more. >>: 50 feet? >>: [inaudible] >> Ivan Tashev: I haven't tried. >>: These are full speed. That would help. >> Ivan Tashev: It's a full-speed USB. >>: 12 megahertz? I get them confused. 12 megahertz or the hundreds of megahertz? >> Ivan Tashev: Hundreds of megahertz. >>: Okay. >> Ivan Tashev: It's a USB tool. Full speed. And everything that can run on that uncertain distance should be able to run Kinect. But I haven't tried. This is pretty much -- it comes with this cable, and this is what I use. >>: And all power supplied by the USB? >> Ivan Tashev: Nope. >>: Oh, okay. I don't see it. All I see is one ->> Ivan Tashev: So this is, you know, standard USB. It carries all the standard USB signals. But instead of plus five we have one amp, we have a way more powerful power supply. It goes -- and then the rest is connected to -- >>: Oh, it breaks out of the ->> Ivan Tashev: Yes. >>: Okay. Gotcha. >> Ivan Tashev: You cannot feed 3G, 3, 4G speeds plus a separate CPU plus the cameras plus the infrared projector with one amp. Period. >>: I'm curious of any plans of developing protocols [inaudible] talk to my own receiver or third-party hardware. >> Ivan Tashev: Microsoft is unlikely to release this. You can buy Kinect device. You can search Internet. Seven days after we ship Kinect, you can download hacker drivers already for Windows. >>: Doesn't that sell devices? Don't you guys like that? I mean, people are going to buy it just for that. >> Ivan Tashev: This is cool and nice and actually the margin is not bad for the device itself. But if you use it with Xbox we get royalties from the games as well. Otherwise, it's completely okay -- you guys understand that Microsoft Research has a very good relationship with most of the professors in academia. They are dying, most of them, to get hold on the device to plug it into a Windows machine and to start to experiment with human-computer interaction interfaces. But from our business team point of view, okay, a bunch of researchers want to do this and that's pretty much it, how many a day, how much it will cost us to ship the drivers, naaa. So we may even -- MSR may eventually release some drivers not as a product but as a free download relatively soon. There are plans in discussion for this. But no promise. We don't know what is going to happen here. It would be nice to go and to play on your Windows machine. And most of the algorithms actually are designed in Windows for both video and audio. We have this connected to our Windows machines, but this is not something we can share for now. >>: Eventually probably. >> Ivan Tashev: Sooner or later. >>: [inaudible]. >> Ivan Tashev: I thought that the day -- the hackers will need at least two or three months before to release those drivers. On the seventh day they are already available for after we ship Kinect. >>: [inaudible] amazing. >> Ivan Tashev: Yes. >>: [inaudible]. >> Ivan Tashev: Which guys? >>: I don't know, the guys who came up with the drivers in seven days. >> Ivan Tashev: They're not Microsoft employees. More questions or it's play time? Thank you very much. Thank you. [applause]

>> Bob Moses: Thanks for coming tonight. Welcome... Engineering Society. Tonight we've got Ivan Tashev talking about...

Related documents

Products

Support

&gt;&gt; Bob Moses: Thanks for coming tonight. Welcome... Engineering Society. Tonight we've got Ivan Tashev talking about...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Bob Moses: Thanks for coming tonight. Welcome... Engineering Society. Tonight we've got Ivan Tashev talking about...