>> Bob Moses: Thanks for coming tonight. Welcome... Engineering Society. Tonight we've got Ivan Tashev talking about...

advertisement
>> Bob Moses: Thanks for coming tonight. Welcome to the December meeting of the Audio
Engineering Society. Tonight we've got Ivan Tashev talking about the Kinect, which is kind of
exciting. I read last night online that in the first month it sold 2 1/2 million units, and the iPad
sold 1 million. So 2 1/2 more -- times more successful than the iPad already. That's probably
one of the most successful consumer products ever sold at that kind of a rate. So it's exciting to
have one of the architects of the device here and look at how it works. I know there's people out
there that want to learn how to hack it. If you want to know that, you have to somehow get Ivan
a little bit tipsy or something.
We have a -- kind of an idea of what we might do next month just in the way of the January
meeting. We're trying to get a guy named Dave Hill to swing through here on his way home
from NAM. If you know who Dave Hill is, he's -- how would you put it? -- boutique audio
designer. He has a company called Crane Song that makes real interesting signal processing
equipment. They just announced a relationship with Avid and they're doing stuff for Pro Tools
now. And he also designs stuff for some of the other very high-end boutique microphone
preamp companies.
Very interesting guy. I see him every time I go out to Musikmesse in Frankfurt in the middle of
February. He's always wearing leather shorts and Hawaiian shirts in the winter of Frankfurt,
Germany. He's a fun guy.
So we hope that's going to happen. That's not solid yet, but that's what we're trying to put
together. So stay tuned to our Web site, and we'll put something fun together for next month.
A couple of housekeeping things. We need to know how many AES members we have in the
room. And we don't have Gary here to count, so do you want to do that, Rick?
>> Rick: Ten.
>> Bob Moses: Ten. Thank you. And want to count the ->> Rick: Yeah.
>> Bob Moses: -- the rest? For those of you who are not members, we'd love to have you join
our association. The AES, if you're not familiar with the organization, is a professional society
for audio people. All kinds of people, recording engineers, product designers, acoustics,
researchers, and so on. It's been around for over 60 years now. A lot of the really important
audio developments in the history of audio have been sort of nurtured and incubated within the
AES, you know, the development of stereophonic sound and the compact disk and MP3 and so
on. A lot of that initial research was presented and discussed within the AES.
And there's a lot of really interesting people. And I can speak from my own experience, it's been
kind of the backbone of my own career, starting companies, raising funding, recruiting
employees, getting the message out, AES has always been there supporting me in everything I've
done.
And we have a lot of fun in these meetings too. So I encourage you to consider joining if you're
not already a member. Speak to any of us during the break or after the meeting if you're curious
about the organization and what it's all about. The Web site is on the -- our local Web site is that
URL on the sign there. If you just go to aes.org, that's the international Web site. You can read
more about it there too.
Let's see. If you didn't sign the little sheet in the back of the room when you came in, you might
want to do that. That gets you on our mailing list. It also makes you eligible for the door prizes
that go out tonight. And we've got some Microsoft software and some other goodies to give out.
And, Rick, you wanted to make a plea on behalf of Mr. Gudgel?
>> Rick: Yeah. MidNite Solar, manufacturer of off-the-grid energy management systems, is
looking for an engineer type who has embedded systems and hopefully DSP experience. So if
this is you or you know somebody who fits that, get in touch with me and I'll put you in touch
with Robin. Or you can grab -- you can go to Google and look for MidNite Solar. You'll find
that they're a startup.
>> Bob Moses: Cool. Also just keeping an eye on our Web site and the national Web site for
other job openings; it's one of the things AES does is advertises job openings and helps people
out.
At this point we usually go around the room and introduce ourselves, just a quick what's your
name, what do you do with audio, what's your interests; just 50 words or less.
I'll start. I'm Bob Moses. I'm the chair of the local AES section. I work for a company with a
funny name called THAT Corporation designing semiconductors for audio. And I'll leave it at
that.
>> Ivan Tashev: Good evening. I'm Ivan Tashev. I'm the speaker tonight. Work here in
Microsoft Research doing audio software and member of the Pacific Northwest community.
>>: My name is Rob Bowman [phonetic]. I'm a product engineer. I work here doing headsets
and Web cam audio.
>>: I'm Travis Abel [phonetic]. I'm not a member yet. But I'm into composing and engineering
and drumming. So I'm kind of everything all together here.
>> Bob Moses: Welcome.
>>: Hi, I'm Trina Sellers. I'm a personal manager for Joey Marguerite [inaudible].
>> Joey Marguerite: I'm Joey Marguerite. I'm a recording artist and a jazz, soul/jazz singer and
songwriter. I have a production company called RooGal Productions
. I worked in radio
and done some on-air talent and doing engineering and live engineering for a number of years as
well.
>>: Rick Shannon [phonetic]. I'm the vice chair on the section. I'm the Webmaster, live sound
engineer, recording engineer, anything audio. No ladders, no lights, no video.
>>: Ryan Ruth [phonetic]. Right now recording engineer and live sound engineer.
>>: Gary Beebe with Broadcast Supply Worldwide in Tacoma, audio salesman.
>>: [inaudible] I work here in Microsoft Research [inaudible] speech signal processing
associated with acoustic [inaudible].
>>: I'm Ken Kalin [phonetic]. I'm recently laid off from a medical device company [inaudible]
for 18 years and now I'm looking at contracting at Physio-Control. I got an e-mail from Bob
Smith [inaudible] and sounded like an interesting thing because audio has always been a hobby.
>>: My name is Matt Rooney. I work here at Microsoft, the GM of mobile games at Microsoft
Game Studios, and before that, about 20 years ago, started on video games doing audio and DSP
for games.
>>: My name's Scott Mirins [phonetic]. I work at Motorola, pretty soon to be Motorola
Mobility, so in the cell phone division and do audio software and that type of thing [inaudible]
stuff like that with phones.
>>: My name is Peter Borsher [phonetic]. I work at Microsoft in the Manufacturing Test Group
[inaudible] hardware design engineer.
>>: I'm Dan Hinksbury [phonetic]. I work with Peter in manufacturing tests, most recently on
Kinect.
>>: Mike [inaudible]. I'm just a guest here checking it out.
>>: Christopher Overstreet, pianist, composer, and private researcher of just really interactive
systems, mapping gestures to other types of output mediums: 3D audio and video [inaudible].
>>: Brian Willoughby [phonetic]. I do software and firmware and hardware design. Currently
working on a multitouch controller for controlling audio [inaudible].
>>: Steve Wilkins. I've worked in psychophysical acoustics in a consulting firm. I'm a
musician. I have a studio and I have a bunch of stuff in the can if anybody wants weird sound
for games.
>>: Adam Croft, primarily film sound, but I've got a background in live [inaudible].
>>: And Greg Duckett. I'm the director of R&D engineering at Rane Corporation. We're a -we design and manufacture professional audio equipment up in Mukilteo.
>>: I'm Dan Mortensen. I do live sound for concerts and on the local committee. And if you
want to shake the hand of somebody who worked with Miles Davis and Thelonius Monk and
Barbra Streisand, then at the intermission here you should come up and talk to...
>>: Frank Laico, recording engineer.
>>: Steve [phonetic], [inaudible] engineer, do certain board design, former chair of the
committee.
>>: Lou Comineck [phonetic]. I'm a broadcast video engineer. I do -- specialize in live
multicamera broadcasts of football games, baseball games, the Olympics. I work for the major
networks, ABC, ESPN, so on and so forth.
>>: My name is Chris. I'm an electrical engineer. I had a recording studio for a while, but now
I build robots right around the corner from you guys at Mukilteo. And in the evening I've been
writing songs [inaudible] I've got to stop because [inaudible].
>>: [inaudible]
>>: Chris Brockett. I'm with a Natural Language Processing Group here at Microsoft Research.
>> Bob Moses: Very cool. It's always a fun crowd of people. And do go talk to Frank during
intermission, because he discovered Miles Davis. And it's an amazing group of people in AES.
So tonights guest speaker is Ivan Tashev. Ivan has his master's and Ph.D. degree from
University of Sofia in Bulgaria. He works here at Microsoft, Microsoft Research, doing audio
and acoustics research. I'm not going to read all this off the Web page here. He was a primary
architect of the sound subsystem in the Kinect device. And I got to see a demo in his lab about a
month ago, and it was really, really cool.
The coolest thing is Dr. John Vanderkooy, who is the esteemed editor of the Audio Engineering
Society journal, one of the most revered scientists on the planet, was dancing to some hip-hop
music, and I got it on my high-definition here. So that was a real treat.
So thank you, Ivan, and thank you Microsoft for hosting us tonight and then giving us this really
interesting preview of what's going on in the Kinect device.
[applause]
>> Ivan Tashev: So good evening. Thanks for coming for this talk. Before even to start, I want
to apologize, first of all, my heavy accent. Here in Microsoft and even in Redmond, more than
one half of us are not born in the U.S., so using this broken English is kind of norm. But I have
seen some puzzled faces when I spoke with people which are outside of Microsoft.
Second thing is that during the talk we will see a lot of mathematics and equations. If you don't
want, just skip them, try to get the gist of what is going on in the presentation of the sound, how
computers see it and how they do process the sound.
It's not intended to be a heavy digital signal processing course; it's just for those who have seen
those mathematics.
And then first couple of words where we are. We're in the Building 99 of Microsoft
Corporation. This is the headquarter of Microsoft Research, this research ring of this
corporation.
Microsoft Research was established in 1991. This is the year when the revenue of our company
exceeded $1 billion. From all of our sister companies with similar revenue, none of them created
their own research organization. Not one of them is alive today. We claim here in this building
that Microsoft Research [inaudible] 850 researchers creates that stack of technologies which
company needs when it is necessary. We bring to the agility of this company to survive this very
fast and very quickly changing world.
We don't do here [inaudible]. We cannot say, okay, we ship to Windows or Office. What we
provide is the pieces of technologies, algorithms, approaches, which help to make Microsoft
products better.
[inaudible] to continue and to speak, I will show you a 60-second video.
[video playing]
>> Ivan Tashev: So today we'll talk about the sound, we will see what we can do to remove the
unwanted signals from one single audio channel, what we can do if we have multiple
microphones, can we combine them in a proper way to remove the things we don't want, and
we'll talk about some basic algorithms which we use, and we'll end up with some applications.
And at the end I'll basically just let the Kinect device with some of the games I brought with me
to run and to let you guys to try it.
Even at some point somewhere here we have a break, and most probably we will start to do some
gaming, to have -- basically to get hot on the technology.
Now, sound capturing and processing. So the point we are going to look to this process is kind
of different than most of the other engineers do. For all of them, this is the microphones, and
they go and record with the highest possible quality. They have the freedom to have a
professional setup where you can put the microphones in a specific way to make the recording to
sound better.
In computer world, and this computer world includes mobile phones, includes your personal
computer at home, your laptop, the Kinect device or anything else in your media room, the
speakerphone in your conference room, usually first we don't have the freedom to do the
professional setup of the microphones, and this requires a heavy processing of the audio signal,
so we can remove the noises, reverberation, et cetera, et cetera, and get some acceptable sound
quality.
And because in engineering nothing is free, this processing itself introduces some certain
distortions in the audio signal. So it's always a question of balance, more processing, we get rid
of more of the unwanted signal and into just certain processing.
But in general, sound capturing and processing from this point of view has all of those three
categories. It's a science. We work with mathematical models, with statistical parameters of the
signals. There is a heavy mathematics in deriving those equations. And we have repeatable
results. We use the same input always and there is the same output. So this is all science of
science.
But [inaudible] the consumer of this processed sound is the human, its own ear and its own brain.
And once we go to the human perception, this is already an act. Because it's very difficult to
explain in equations what humans will or will not like after we do some heavy processing.
And of course it's a craft. As most of the audio capturing can now do recording, you always
have some tricks and some stuff which you put in your processing algorithms in a way that it
sounds a little bit better than the same algorithm from the competing lab.
Okay. So in computing we have mostly two major consumers of this captured sound. First is the
telecommunications. Starting from the mobile phone, desktop PCs, we run software like Skype,
like Microsoft Office Communicator, Windows Live Messenger, Google Chat, you name it.
And the second is speech recognition. This is something very specific for the computer world,
but all of the speech recognition is important component of the interfaces between the human
and the machine.
And it has its own specifics, which we're not going to go deep here today.
So in general the discussing of those basic algorithms is pretty much the meat of the talk today.
And we'll just show some aspects of building the end-to-end system. But the principle here is
that a chain of optimal blocks, each one of them is optical in certain sense, is usually not optimal,
is suboptimal, so you have to trick and tune those blocks together in align as they are connected
into the processing.
So let's talk about the sound and the sound capturing devices. From a physical point of view, the
sound is nothing else but moving compressions and rarefactions of the air. They usually move
straight and the sound has several characteristics. The first is the wavelengths, the distance
between -- the closest distance between two points with equal pressure.
And it has its own frequency. If you stay here, how many of those will go to that point for one
second. Then based on those three important properties, frequency and wavelength, which are
connected to a constant called speed of sound, and in there's it's 342 meters per second at 20
centigrades.
This speed of sound is pretty much well without almost any thinking there. You change the
atmospheric pressure, the speed of sound changes. You change the temperature, the speed of
sound changes. You change the proportionism of the gases in the atmosphere, the speed of
sound changes.
So it varies and it goes down to 330 meters per second at the freezing point of zero centigrades.
But every single wave will have those parameters. So intensity is how much the atmospheric
pressure changes. And usually it is measured in logarithmic scale. So we have to have 0 dB, a
reference. And a reference is selected to be 10 to the 12 per square meter. This is the amount of
energy we have from one square meter is 10 to the minus 12 watts.
If we expose and share one square meter on sunlight, you get around 500 to 1,000 watts. So the
first conclusion here is that sound is extremely, extremely low-energy phenomenon.
Second, usually, if you count the intensity as the power which goes through a given surface, we
can use the sound pressure level which is measured correctly in changes of the atmospheric and
of the pressure in the atmosphere, and again because it's logarithmic scale, 20 [inaudible]
pressure is selected as a reference. This is the threshold of hearing of a human ear, around 1,000
hertz. So 0 dB is a sound at 1,000 hertz, which we can barely hear.
We're not going to discuss any aspects of human psychology, psychoacoustics and how humans
perceive this sound. That's pretty much the source of separate talk. But this is just to measure
the reference here.
So the propagation of the sound across the area is not free. We have energy losses. Of course
first with the distance, because we have the same area but sound basically goes in [inaudible]
well, if you increase the distance twice, you have two times smaller energy through the same -through the same surface. And that's the famous one under [inaudible].
Besides this, the compression and ratification is related to losses because of the shrinking and
expansion of the molecules. And this is pretty much the losses of the attenuation of this sound.
That's a function of the frequency in decibels per meter.
And what we can see here is that in up to 20,000 hertz, which is how much we can hear, it's
negligible. For one meter it's 0.6 dB. And once we go one meter further, we will [inaudible]
because of the distance. So this is why in most of the mathematical estimations we can see that
the sound propagates in the atmosphere without losses. This is valid for this room. This is not
valid if you tried to do a sound propagation on a couple of miles by some other resource. But for
most of the normal conditions when we use a sound as an information medium, that's true.
And just couple of words about another specific for the sound phenomena. This is that the
sound, besides being a very low-energy phenomenon, it's very [inaudible].
So humans are considered that they can hear between 20 and 20,000 hertz. This is 1,000 times
difference in the frequency. For area information, the visible light, the low-frequency red signals
have a wavelength around 340 nanometers. And the bright blue [inaudible] is 720. Which
means all the colors we can see actually capture less than two times difference in the wavelength.
And, on the other hand, in this sound, we have 1,000 times difference in frequencies.
And there is one more thing. The wavelength of the visible light is measured in nanometers.
This is 10 or minus 9th. Which means that every single object around is way, way, way bigger
than the wavelength. So the sound propagates in straight lines. We have perfect shadows. You
can see my shadow. And this is definitely not the case for audio. The audio is surrounding us.
Objects are comparable with that wavelength.
One hertz is 3.4 meters. 1,000 hertz is 30 centimeters. 10,000 hertz is 3 centimeters. This is
[inaudible]. So not only, not only, we have objects which are comparable with the wavelength,
but together simultaneously in the same sound to use, we will see pretty much a sound which
behaves like a visible light. For 10,000 hertz, this thing will have a clear shadow, and
10,000-hertz sound source will not be hearable here.
On the other hand, a hundred hertz will go around this desk without any problems. So we face
all of those, the defraction [inaudible] effects during the submitting of the same sound because
our environment has size of the objects which are comparable to the wavelength, and in some
cases you will have a pretty much [inaudible] with the light. And in some cases they are
acoustically transparent.
And of course you can see how much bandwidth can -- at the interval and frequencies different
animals can hear. So human hearing is between 20 and 20,000. And you can see that we are
actually pretty good listeners. Better than us are -- common house spider I think is what is
missing here.
Interesting case is right here. The bat, which can hear between a thousand and 10,000 hertz, but
[inaudible] signals at 40-, 45,000 hertz. And this is how the animal basically can do the location
and detection of the objects in front of it.
And the other case is the whales. They can do this and they have this huge [inaudible] in water.
They have way, way worse hearing in air. And the reason for this is that it is very difficult to
match the acoustical [inaudible] of the air in the water.
And just for information, what means how loud is the sound, so 0 dB is the threshold of hearing.
You drop a pin, it's around 10 dB. In a quiet library, you have around 40 dB. That's a very quiet
sound. Here our sound pressure level when I do not speak is around 50 dB SPL. And it goes
higher and higher. To hear a machinery shop, which is around 100 dB, this is close to the
threshold of pain and it's extremely unhealthy, and it goes up to 170 dB for the Space Shuttle
launchpad.
This is for the set of frequencies we can hear. Otherwise, for those which we cannot -- let's say
low frequencies, I just can't give you an example. A passing cloud above us generates a sound to
170 to 190 dB SPL. But the frequency is way, way below. One hertz is more changes in the
atmospheric pressure, and it's a subject of meteorology, not subject of acoustics.
So how we're going to get the sound. Those microscopic changes in the pressure, in the
atmospheric pressure, and convert it to an electrical signal. So those are the microphones. All of
the microphones use some physical effects to convert small movements of a diaphragm back and
forth into electrical signal. This can be a crystal, which is based on the piezoelectric effect. We
have two small pairs of crystal connected to the diaphragm, therefore it moves back and forth
because of the changes of the atmospheric pressure, and we have electrical signal.
This can be a dynamic which is pretty much inverted loudspeaker you hear [inaudible] and we
have a coil which moves inside, moven- -- driven by the diaphragm.
And condenser, which is pretty much two plates and one of them moves and this is -- changes
the capacity, and somehow we convert this into an electrical signal.
So we're not going to go in detail how the movements of the diaphragm are converted to an
electrical signal, but what matters here is actually that we have to have the diaphragm. So pretty
much the microphone is nothing else but a closed capsule with a diaphragm. Closed capsule
means the pressure here is constant. And when it changes on the outside, the diaphragm starts to
bend back and forth.
And if the size of this microphone is smaller than the wavelength of the sound we want to
capture, then we can see that we capture this, the atmospheric changes in the atmospheric
pressure in a point, in the space. So this is acoustical monopole. On top of this, this microphone
pretty much reacts on the sound in the same way regardless from the direction it came from,
because those are changes in the atmospheric pressure.
Completely different case we have if we have a small pipe and the diaphragm in the middle. So
this is called acoustical dipole and has a very specific behavior when the sound comes from
different direction.
So if we have a sound coming from 90 degrees, the soundwave, the changes in the pressure came
here, and we have the same pressure of both sides of the diaphragm. It doesn't move.
If the sound came from the access of this pipe, first it will hit the front. And with some small
delay it will hit the back. This means that we'll have a difference. And it will react. So this is
why the activity pattern of this dipole is figure eight. At 90 degrees plus/minus, we don't have
sensitivity. Even if we have a loud sound, it is not registered. But if it comes from across the
access, the access of the microphone we have [inaudible] sensitivity.
Unfortunately because this microphone doesn't do anything else but snap and measures the
difference in the atmospheric pressure here and here, so pretty much it does the first derivative of
the atmospheric pressure.
As a result, we have a not very pleasant frequency response. If the signal [inaudible] with the
very low frequency, pretty much the same thing happens on both sides and our microphone is not
quite sensitive.
Once we start to have a higher frequency, then the sensitivity goes up. But, still, this pressure
gradient microphone is important component of the microphones. And so far we assume that a
microphone is in a so-called far field sound field. Which means the sound is somewhere very,
very far, and the magnitudes on both sides of the microphone are the same.
If we have a sound source which is closer, what happens is that we have a certain sound part
here, and the sound has to travel this distance, which means that we have already a comparable
difference in the sound pressure level here and here just because of the difference of the sound
parts.
And then the microphone starts to behave a little bit different. Now, of course we don't want that
6 dB-per-octave slope towards the higher frequency, so we have to compensate somehow. And
in a simple way, we can compensate by adding one single as a group which has the opposite
frequency response, or we can have a flat frequency response.
And if you compensate for far field, as you can see in pretty much the entire range, we'll get our
flat frequency response. So let me reiterate this. If you will turn back here, this frequency
response compensated with one RC group, we can flatten it, and then for far field sounds, we'll
have the green line here.
But then when we have a sound source which is closer, so we grab the microphone let's say from
this distance and I start to approach it, then the effect of the different -- of the different -difference will actually -- what's this? This is not me. Anyway, so it will basically have the
effect that we did overcompensation.
And the basses ->>: It could be somebody outside wanting in.
>> Ivan Tashev: Doesn't matter. So what happens that for closer distance, let's say this near is 2
centimeters, the microphone is a close-up microphone, we will have overcompensation. Okay.
If we have compensated for a microphone which is close to our mouths, means to straighten
the -- this line, then for far field sounds we will have basses basically going down, what we call
this.
So what happens when you hold the microphone, which is usually estimated singles, which is
usually estimated for 20 centimeter to centimeters and they do this, and you see how basses go,
basically enforce it, and you hear way, way different voice.
Or if you have a certain microphone, directional microphone, on your [inaudible] right here, and
certainly all the sounds, especially in the lower part where the noises are, start to disappear. So
this is called noise canceling microphone. It's nothing else but properly compensated figure
eight microphone.
And I think this is called presence effect when somebody in the actually singles have used this
by changing the distance of the microphone, they actually -- what happens is they move the
frequency response up and down. It's kind of on the fly, on-the-fly control of the basses.
So those are how the microphones look inside. This is not a microphone. This is microphone
capsule inside and enclosure. And inside usually we have something like this. Those are
omnidirectional microphone. The capsule closed with the diaphragm in front. You can see some
small, small holes there, but this is just to compensate for the changes of the atmospheric
pressure. And the directional microphones already have well, well visible holes on the back. So
it's a kind of a small pipe. Small pipe. And with the diaphragm in front.
In general, combination between the connectivity pattern of an omnidirectional microphone and
figure eight microphone in certain proportion lets [inaudible] which is between 0 and 1, can
bring us to large variety of directivity patterns of the microphones. And this is when we have
our cardioid, supercardioid, hypercardioid, and figure eight microphone.
So if this [inaudible] which is how much omnidirectional microphone -- portion of the
omnidirectional microphone that we have is equal to 0 -- let me turn back to show you the
equation. So we're talking about this alpha. If this alpha is 1, the directivity pattern is constant,
this is omnidirectional microphone. If this alpha goes down to 0, this is 0, this here will just have
a cosign of the angle. This is figure eight directivity pattern. And something in between we can
reach those different directivity patterns.
Each one of them has a name because of a certain specific property. Omnidirectional
microphone is well known because it captures the sound from everywhere. The cardioid
microphone is known that it completely suppresses the sounds coming from the opposite
direction. Supercardioid microphone has the highest front-back ratio. It means the sensitivity to
the image which captures from this half is compared to the image which captures from the front
half is highest. This makes it a very good microphone to be placed on the edge of the theatrical
scene to capture the artists and not the coughing in the -- of the public.
The hypercardioid microphone is known that it has a directivity ratio. It means suppresses
maximum amount of noise coming from all directions. And used to be times when actually
those directional microphones we had two actual microphones in the enclosure, and you get the
true signals [inaudible] and you can actually on the white change the directivity pattern of the
microphones. That's long, long time ago. Currently we just have capsules which have a specific
directivity pattern.
Of course those were theoretical derivations of the directivity patterns. Once we start to deal
with real microphones, we start to see some different things. For example, this is a copy
directivity pattern of one of the microphone -- directivity for cardioid microphone.
And you see the manufacturer says, okay, it's flat. And down to around 300 [inaudible] starts to
go down. At 180 degrees we have pretty much 15 dB separation. So the cardioid microphone
doesn't go to 0 in the back, but 15 dBs, pretty decent actually directivity.
When we get this microphone and actually measure it in [inaudible] chamber, what happens is at
first we see that it's not kind of equal. For certain frequencies it has a tail, so it's a hypercardioid.
For a certain, it's a pure cardioid, et cetera. So they're not uniform across the frequencies. And
the -- at 0 degree the frequency response is not quite flat. We have [inaudible] dB going up and
down.
But regardless of this, this is a relatively good match between what the specifications is and what
the actual microphone measured in [inaudible] chamber shows us the parameters.
And everything we discussed so far was about having a microphone freely hanging in the air.
Because we can see that those two are hanging in the air, the [inaudible] is very small compared
to the wavelength we want to capture.
Unfortunately this is not the case when we have the microphones in the devices: telephones,
speakerphone, microphone, like a Kinect device. Then the things get complex. And the first
thing which happens is that we already start to see quite a symmetric directivity pattern. What
you actually see is the directivity pattern of one of the four microphones we have in the Kinect
device.
Usually having a microphone around -- placed in an enclosure makes the directivity patterns
worse. The directivity and how much noise we can suppress is going down.
Kinect is one very pleasant exception because in the design and the position and placement of the
microphones we actually use the shape of the enclosure to form the directivity patterns of the
microphones and actually increase the directivity of the index, increase the directivity of the
microphones. So we can basically suppress more noise in advance even before we start to do
any digital signal processing.
Okay. But in general to make a simple, short summary, a cardioid microphone has 4.8 dB noise
reduction compared with a single omnidirectional microphone. So we are already 4.8 dB ahead
in suppressing noises and removing the unwanted signals.
So it's highly recommended to use directional microphones if we know and can basically guess
where the speaker, where the sound source, is coming from.
In the devices like [inaudible] it's very obvious because they're in front. And we do not expect it
to have sound sources behind. And if there are, we actually don't want them. Those are usually
reflections from the wall behind.
So this is why all four microphones in Kinect point straight forward and try to suppress the
noises coming from left, right, and back. There's speakers, there are reflections. We don't want
them.
Okay. Now we're ready. Our sound is already electrical signals converted from the microphone.
Now let's see what we are dealing with. We'll talk about the noises in the speech signal. We'll
see how computers actually see the sound.
First, noise. What's noise? Let me play some noise.
[audio playing]
>> Ivan Tashev: This unpleasant noise is called white noise because it covers evenly all
frequencies which we can hear from 20 to 20,000 hertz. It is mathematical abstraction. We
cannot see this in the nature. In the nature we can see this.
[audio playing]
>> Ivan Tashev: This is inside of a passenger plane, a noisy place in general. But this noise has
a different sound because it has a lot of more low frequencies. And the frequency stores, the
upper part go down.
And of course we can have another unpleasant noise.
[audio playing]
>> Ivan Tashev: Which contains many, many, many, many humans talking together.
Besides the spectrum, the frequencies those noises contain actually statistically speaking they are
quite similar. All of them can be described, the magnitudes can be described to a Gaussian
distribution, which means that this is the simplest statistical model and we can start and play with
them.
Of course the white noise has a flat frequency, flat spectrum. Well, the other noises pretty much
go down towards the high frequencies and have a higher magnitude towards the lower part of the
spectrum.
So skip this. Speech. Speech happen to be different by many, many, many different aspects. It
has nothing to do and is not like the noise. The first thing is that if you tried to do a statistical
model of the distribution of the magnitudes, we don't see this to look like a Gaussian distribution.
This is the blue line. It's way pickier. So the speech there is mostly around the 0, goes with the
sharp, basically big magnitudes. So Gaussian is not quite good statistical model.
On top of this, the speech signal has parts with completely different -- with completely different
statistical parameters. We see vowels, and those, so we will see, have a very good harmonic
structure, and we have unvoiced, fricatives with sh, ps, and those are kind of noise-like.
And speech is not constant as the noise but usually it has [inaudible] integral part of the speech
signal, it is chopped on parts with a different -- completely different parameters.
And unlike the noise, which usually after some reverberation here or there, is kind of [inaudible]
speech sources usually points which we can point or get games from there.
On top of this, so we discussed that it is not kind of Gaussian, and those are three representations
of three different speech signals. First let me just comment what this means. This is time in
seconds and this is frequency, up 0, down is 8,000 hertz. And this is the scale, the magnitude.
Reddish color means higher magnitude. So what we can see that when you say sh, what we see
that we have mostly energies around 3-, 4-, 5,000 hertz. It is not much down and it is not much
below 1,000 hertz.
On the other hand, it is a -- it's kind of [inaudible] impulse signal. We have a frequency and
some energy here, and the rest is pretty much low magnitude.
And the last is this is just the vowel. What we can see is that this signal has a very good
harmonic structure. It has a main frequency, which is called pitch. This is the speed of vibration
of our vocal cords. So then this basic signal, and then vibrate kind of a -- like a hammer
basically hitting a surface, which means besides the main frequency we have all harmonic
second, two times, three times, four times, five times, up to 20 times in frequency, and then this
basically chain of impulses goes through our mouth which with muscles we change the shape
and form this envelope.
And those maximums in the envelope are called formant frequencies. So from now what we can
say is that, okay, we have a maximum around 600 hertz here and a maximum around 2,000 hertz
here. Take a generator of those pulses, play through a speaker, and you will [making noise],
because based on those formant frequencies, we recognize which vowel is pronounced.
Humans have a different pitch. Usually males, they have a lower pitch. Females, they have a
higher pitch. But this shape, the formant frequencies are the same. And this is why [making
noise] is [making noise] regardless of who is saying it.
Okay. So what we can do with the speech signal. We can do noise suppression. Why? Because
we saw that speech signal is kind of quite different down the noise signals. So there is a way to
distinguish them and to suppress the noise.
Noise cancellation is something different. Noise cancellation is removing a noise we know.
Classic example is in the car we place a second microphone in the engine compartment. And
this is the noise of the engine. We can try to find how much of it goes to the microphone which
stays in front of the driver and to subtract it. So this is noise cancellation.
More common actually is those noise cancellation headphones in the planes when we have small
microphones on the outside of the ear cups so we can estimate how much of the plane noise
comes inside the cups and subtract it.
In general, all of those class of processing algorithms which we use to make humans to perceive
the speed of signals better is called speech enhancement.
And one more processing which we're not going to talk about is active noise cancellation. So
this is what happens in those noise canceling headsets. Nothing can stop us to -- we have a lot of
loudspeakers here, and if we have some low frequency noise going around, to make the
loudspeakers to send opposite signal to this low frequency noise. And suddenly we can have a
quieter room.
That works to certain frequency range, and it's kind of tricky. But it can be used. So we're going
to talk mostly about speech enhancement and we're not going to touch noise canceling
headphones, we're not going to talk about removing the noise from -- let's say from this room.
So once the changes in the atmospheric pressure became electrical signal, the first process which
happens with them is so-called discretization. Computers, they work with numbers. And this
simply means, ignoring this whole math here, is that we're going to sample to measure the
magnitude of the signal with certain -- on certain intervals. And those intervals is called -- those
are called basically sampling period. Or, inverted, this gives us the sampling frequency.
And there are several people who left their names, but one of the most famous is Claude
Shannon who basically said that we should do this sampling with at least two times has higher
frequency than the highest frequency in the signal we want to sample.
So technically this means that for humans which can hear up to 20 kilohertz, we should do this
with 40 kilohertz sampling rate. The standard is 44.21 kilohertz. And this is what we use in
most of the recording equipment. At least. Professional equipment goes to 48. We have one 29
already in computers.
In the telephony we actually go down because more frequently sampling the signal means more
numbers to crunch and to process, and because the speech signal goes up to 6, 7 kilohertz,
usually 16 kilohertz sampling rate is considered that that's enough for speech signal, for speech
communication.
Okay. Second major thing which happens with the speech signal once it is sampled is
quantization. So we get our number, the value of the atmospheric pressure or the voltage, but
that thing is basically -- has to be converted to a number. And the numbers in computers tend to
be discrete. So it can be 1, 2, 3, 4, 5, but nothing in between.
And this is why this process of quantization is converting the analog signal, analog value, to a
discrete value. And from the moment we did the sampling and quantization, we now have a
string of numbers. Pretty much this is what computers work with. We have those numbers with
certain frequency which arrive with reduced precision because of the quantization and we don't
know what happens in between, but the sampling theorem of Claude Shannon tells us that it is
okay because we did the sampling with at least two times higher frequency than the highest
frequency we were interested of.
Now, next thing, computers also kind of don't deal quite well with stream of numbers. So what
happens is next thing we drop those numbers in packets, and those packets are called audio
frames. And the size of audio frames is between 80 and 1,000 samples which means usually
typically between 5 and 25 milliseconds. So we take this piece and then we try to process it
together.
Next thing which happens is that this processing happens in so-called frequency domain. We
change the representation of the signals. If it take 1,024 samples, as we sample them from the
microphone, this is a function of the time. If you use so-called Fourier transformation, we'll get
again 1,024 samples. But they represent the signal in frequency domain. So we -- this -- for this
particular frame, we know which frequencies we have higher magnitude or lower magnitudes.
And because this signal goes as a string, what actually happens is those frames are 50 percent
overlapped. You can see here this is one frame. We convert it to frequency domain, do
whatever we want here, this is the output, and then it is converted back to time domain. But next
frame is 50 percent moving forward.
And then they are properly weighted so we can combine them in a way that this signal has no
breaks and is kind of properly aligned.
So for now on, everything what happens, and we will be discussing what happens here, our input
for our algorithms is one vector of 256 or 1,024 samples, in frequency domain the spectrum of
the signal, and then we do do some stuff here, and then the rest is this all so-called overlap in our
process. This is pretty much the standard audio framework for all audio processing algorithms.
Okay. Now, what we can do and how we can remove the noise. I'll try to increase the speed
here because we will see a lot of equations. But roughly what happens is that we get this vector
[inaudible] which is [inaudible] presentation of -- in frequency domain of the current audio
frame. And that's a mixture between speech signal and noise signal.
So what we can do is if we know how much noise we have, let's say we get a signal and we
know it's 15 percent noise, if you can apply a number, we'll multiply by 0.85, presumably we'll
have just the speech signal left. And if you can do this for each frequency bin, K here is a
frequency bin, this real gain, which will be different for every single frame -- but if we can
somehow compute this thing, then the output will have less noise. Because at least we move the
equal portion of the edge.
So pretty much this is what happens. We'll heavily use the fact that the speech is with pauses in
between. And we'll use [inaudible] called voice activity detector. When the voice activity
detector tells us there is no speech, it's just noise, we'll update the noise model. This is our
statistical parameter, one per frequency bin, which simply tells us the noise energy for this
particular frequency.
And then pretty much on every single frame we'll try to compute this suppression rule which is
that number, usually lower than 1, which you can apply and then convert back to time domain.
So this is pretty much what a noise suppresser does. So we're going to skip here a substantial
amount here.
Pretty much one separation rule is kind of complex and it's not a function just of the current
noise, it's a function of the a priori and a posteriori signal-to-noise ratio. It is something which a
lot of smart people work for a long time. But this is already considered the classic, a classic
digital signal processing algorithm.
And one noise suppresser -- okay, we skip this. And one noise suppresser is something which is
already not considered something we should spend time on or just take that algorithm, use it.
And it actually improves a lot. The signals on the output of the noise suppresser sound nice.
The noise is gone.
I'll have to underline here that this is not empirical. If you do a very serious testing about
understandability, means you have a signal -- speech signal plus noise. And ask people to do
transcription, and then you do a state-of-the-art noise suppression and ask another group of
people to do transcription, the percentage of the errors will be pretty much the same. Nothing
can beat 100 billion [inaudible] between our ears.
Okay. But the difference is that those who listen to the noise suppressed processed algorithm
perceive it as better because they load less the brain. So it reduces the fatigue and, in general, it's
considered more friendly to the speaker. Instead of we to spend our brain power to remove the
noise, the computer does this for us.
>>: So what technique of noise suppression did you do for Kinect?
>> Ivan Tashev: We'll see a block diagram, partial block diagram what happens. But, yes, we
have a noise suppresser at some point after the so-called microphone array.
So, now, microphone arrays, what they do. Usually I try to compare the sound processing from a
single microphone as trying to do image processing or to make a picture with the camera which
has one pixel. This is pretty much it. We have in sample a pretty complex three-dimensional
propagation of the soundwave in one single point.
We don't know where the signal came from. If we place more than one microphone in a certain
mutual position which we know, then we can do more interesting stuff. And this is called
microphone array.
We'll see that multiple microphones became microphone array when we do the processing and
tried to combine the signals together. This is not a microphone array. This is a stereo
microphone. We don't do any processing. We take the two signal and Rick records them. That's
it. We leave the brain to do the processing.
Let's see what we can do with the microphone arrays. So you see if we have -- we have here
multiple microphones. In this case we have four. Here we have eight. In a circle geometry, they
can be placed into a car. We have four microphones here. And having those multiple
microphones allows us to sense where the sound came from.
Why? Well, it's simple. Here is the microphone array. The sound comes from here. It will
reach first one of the microphone, then second, then third, then the fourth. If the sound came
from here, the order will be opposite. The sound moves [inaudible] slow from computer
standpoint. This is well, well detectable. The difference between here and here is 7/8 samples or
16 kilohertz sampling rate, which allows us to have a pretty decent sense where is the sound
came from.
Next thing is to know which direction we want to capture. Once we know this, what we can do
is the following. For example, if I want to capture Rob's voice, I know the delays, the sound
coming from Rob's direction, and then what I can do is with the processing phase to delay this
signal in [inaudible] this last problem, the second microphone to the last one, the third
microphone to the last one.
Once I send those signals, all signals coming from Rob's direction are in phase and we will have
a constrictive interference.
However, if we have a sound coming from this direction, even after this sure thing, the signals
are with different phase. And we'll have a destructive interference and the level will go down.
So similarly with this simple delay and [inaudible] processing, I started to have a better
directivity towards this direction and started to suppress the sounds coming from other
directions. And because, if you can remember, this noise was kind of spread around us, but
usually the speaker is point source, then this is important [inaudible] the beam [inaudible]
separates the other signals which are mostly noise or Rob's voice reflected from the ceiling and
from the walls, which is reverberation and something we don't like as well. And now have a
better signal on the output of this microphone array.
So this very simple algorithm, as we can see, has not quite good directivity pattern. This is
direction. And this is frequency. For lower frequencies that have pretty much no directivity.
Yes. Stores the design of direction. I have a flat frequency response, I'll capture his voice nice
and clear. But for certain frequencies I'll have some areas with [inaudible] sensitivity. But in
general it's a high-directivity microphone. Nothing more, nothing less.
And now I will show you this in action in a quick demo. Okay, one, two, three, testing. So we
can see that the device can sense when I'm talking, one, two, three, testing, testing, testing. You
see direct line is a [inaudible] sound source localizer which localizes what is the point source and
tries to listen towards that direction, suppressing the noises coming from other directions. One,
two, three, testing, testing. One, two, three. You can see one, two, three, testing, testing.
And we can do something like this. Okay. One, two, three, testing, testing. So I'm let's say
three or four meters away from the device. One, two, three, testing, testing, one, two, three,
testing, one, two, three, testing, testing.
[audio playing]
>> Ivan Tashev: So what is the effect? It shortens the distance. If I have a high directivity, I
will capture less noise and less reverberation. And from four meters I will sound in the same
way as I am [inaudible] distance using a regular just one microphone. That's pretty much it.
Reduces no noise, reduces no reverberation, and shortens the distance, the perceivable distance,
between the speaker and the microphone. Technically this means better quality. Because close
up to the microphone in general means less noise and less reverberation.
One more. Let me see how this will work. So what is going to happen here is I'm going to
record in parallel. Okay. So first I'll mute the microphones, the PA. And then I'll try to record
in parallel, in parallel a signal from one microphone, which is on my laptop, and with the
microphone array. And you will see the difference.
One, two, three, testing, testing, testing.
[audio playing]
>> Ivan Tashev: This is the single microphone in front of me.
[audio playing]
>> Ivan Tashev: And this is the output of the microphone array. So first you see the noise four
is gone. Second, I'll play this again, but you can hear how whole, more reverberant is the signal
here. And this sounds like I'm closer. And the distance was absolutely the same. Let's hear this
again.
[audio playing]
>> Ivan Tashev: And if you look at the measured signal-to-noise ratio here, we started at 14 dB
signal-to-noise ratio. This is measured on one of the microphones. And on the output we had 36
dB signal-to-noise ratio, or we did an impressive 22 dB suppression of the noise without hurting
actually much the voice signal.
>>: Now, was there any con- -- not conversation, was there any problem with the low end or
anything? I mean, it sounded great, but what is the -- what's the ramification on the low end?
>> Ivan Tashev: So on the low end, if you mean the lower part of the frequency band, yes. This
size of microphone arrays is not quite efficient for a hundred hertz. For speech you actually care
about 200 to 7,000 is pretty much sufficient for anything, for telecommunication, for speech
recognition. So this is pretty much a low. Below 200 hertz just cut it. Don't process it. Don't
waste your time. Just remove it.
>>: But it'd have to be twice as big to go to [inaudible]?
>> Ivan Tashev: Yes.
>>: Okay.
>> Ivan Tashev: So technically if you go into the research labs which do audio research, you
will see 20- to 36-element microphone arrays, three or four meters long, et cetera, et cetera. But
Microsoft Research is a part of research of a commercial company. And, yes, we do basically
search. We enjoy those algorithms. But when the time comes to do some prototyping, always
ask ourselves, hmm, if we will ever want to make this a product, is this size arrays reasonable or
we should stick with something smaller.
Okay. Where we were. So just couple of terminology here about the microphone arrays. So this
process of combining the signals from the four microphones is called beamforming because it
follows you. So the green line follows kind of a listening beam towards given direction. The
algorithm I described as delay-and-sum is the simplest, the most intuitive, and of course the least
efficient algorithm.
By change the way we mix those four signals, we actually can make the microphone array to
listen to different directions without any moving parts. It's the same as the sound operator which
holds these big directional microphones during making movies and moves it from sound source
to sound source, so we can make the microphone array to make this ultimately clear. So this
process of changing the beam, the listening [inaudible] is called beamsteering.
We can do even more. We can do so-called nullforming, which means I want to capture you but
I don't want to capture Rob which talks in parallel, so I can tell the microphone that I mix the
microphones in a way to have a pull towards this direction and no towards this direction, so all
the direct part from Rob's voice will be suppressed. Of course there will be some reverberation
in the room, and I'll capture a portion of it, but most of the energy will be gone.
And of course we can do the same thing with no steering. For example, I want to capture your
voice but Rob's is talking and moving back and forth. I can steer that null to suppress him
constantly.
And of course sound source localization is important part of every single microphone array
processing because before to point the beam we have to know the direction. So we have the
ability with this microphone that with any microphone array to detect what the sound came from.
And actually quite precise. We're talking about a couple of degrees here. At four meters my
mouth [inaudible] size. So pretty much I can pinpoint the beam to the mouth of the speaker.
>>: So you could take that into that New York cafe and hone in on somebody's conversation and
you could record all four channels and do it later on however you wanted, right?
>> Ivan Tashev: That will be more difficult to do with four channels, but let's say that we can
design a microphone array which can do this even couple of [inaudible] somebody sent a link of
a gigantic microphone array on top of the basketball field with 512 microphones, and after this
you can go and listen what the coach tells the players, you can listen to some conversations
between the attendees of the game, and that can happen quite well.
>>: Takes a lot more microphone to do that well.
>> Ivan Tashev: Yes. Because ->>: Sounds like you could do pretty good with four.
>> Ivan Tashev: Yeah, but those -- this is -- the distance is up to three our four meters. Beyond
that the reverberation already is too much. And you need a larger array and more microphones
and more processing power.
>>: You need a minimum of four microphones to make this work, or can it be done ->> Ivan Tashev: Yes. Trust me, if you can do it with three, we'd do it. Every cent is counted in
the cost of goods of the device.
Okay. Just we see two types of beamformers. One of them is called time invariant beamformer.
This means those mixture mixing coefficients for each frequency beam for each channel we
precompute offline for beams every five degrees, store them into the computer memory. And
then when the sound source localizer tells us 42.3 degrees, we take the weights or the
coefficients for the beam at 40 degrees, and because five degrees are kind of the beam width, we
can pretty much cover the entire space without doing any computation, serious computation, in
real time.
This works if you assume that the noises is isotropic spread around us with equal probability
from each direction. And we cannot do this on the fly, no steering can, when we have a
deserving sound source.
Adaptive beamforming actually I will turn a little bit to the slide. So after a lot of [inaudible] we
came up with this formula for the weights, the mixing coefficients. The adaptive beamformer,
it's absolutely the same. The only difference is that those noise models are updated on the fly.
At every single frame we recompute the best mixing matrix, and this is why when we have
somebody talking we want to capture, there is another person which talks and we don't want to
capture, after less than a second, our directivity pattern suddenly changes. You see here we have
a desired sound source, it's all red in the directivity pattern, and we have somebody who is
talking and we don't want.
And the computer actually says that, okay, there is somebody, some sound source we want to
suppress, and apply it then. This is typical application of convergent adaptive beamforming
which places a [inaudible] undesired.
So there's a direction and have a pull towards the desired direction. So we have four
microphones. And means we can satisfy four conditions. The first is towards the listening
direction we want, you need gain and zero phase shift. So I have three other conditions to play
with. If we have a second source, we can have another null. If we have a lot of sound sources,
we'll need a lot of microphones. This explains why we need 512 microphones for that basketball
field.
Okay. And one more -- just I'll do a quick demo. We can go a little bit further and we can do
localization for the sound for each frequency beam and do some additional beyond just a linear -the bilinear [inaudible] which was just summing of the microphone array signals from the
microphones weighted in a certain degree. And what we can do is we can apply a weight. So
this is kind of a spatial suppression. And this is a simple illustration.
[audio playing]
>> Ivan Tashev: We have a human speaker here, an array here tucked in here. This is
frequence. And of course a lot of noise.
So just for the sake of the experiment, what I did is that, okay, everything which is plus/minus 20
degrees goes through. Everything beyond this put again zero. We don't need those signals. And
this is what we have in the output.
[audio playing]
>> Ivan Tashev: So the radio is gone completely. We have two sound signals in the same
frequency band, and the only difference is that they came from different directions.
Using the microphone array, we can separate and suppress those which we don't want. Her voice
sounded slightly distorted, and that's, again, want to repeat, the filter was just like this. We have
way more sophisticated statistical methods to introduce [inaudible] distortion as possible and still
to suppress whatever we don't want.
So pretty much this is what we just saw as a function of the time. And this is direction. This is
the radio here. And this is her voice. You see that we have enough resolution to let this go
through and to suppress this in the terms of directivity index, the blue line is just the
beamformer, and you can see that just spatial filtering actually gave us around 15, 20 dB better
directivity index.
It's time for 15 minutes break. Go around. And, by the way, I will switch to the Kinect device,
so we can use the break to play some games. I brought actually a real Xbox with real games.
And in the second part we will continue to see how this thing works. Enjoy meantime the
device. Do some socializing. And after 15 minutes we'll get again together. There will be a
raffle to see who is going to get the prize, the door prizes here. In short, 15 minutes break,
people.
>> Bob Moses: Okay. If we can get people to come back and sit down, we're going to do our
raffle now. Must be present to win. We have four Microsoft pens. That will be the first four
winning prizes.
Oh, Brian Willbury. Willoughby. Sorry.
[applause]
>> Bob Moses: Okay. Where's Rick? Do you want to write these down, Rick, who wins what?
Greg Mazur [phonetic], come on down. Also known as the cookie lady. Greg is responsible for
your treats. Thank you, Greg.
Rick [inaudible]. Travis Abel. Congratulations.
Okay. Next we have a one-gigabyte flash drive thing. Scott Mirins.
Next we have a copy of Windows 7 Home Edition. Ken Kalin. Did I pronounce -- sorry for all
the names I'm butchering, by the way.
Windows 7 Ultimate. Gary Beebe.
And the grand prize, this is a copy of Ivan Tashev's Sound Capture and Processing book, and if
you're nice he might even sign it for you. Steve Wilkins.
>> Ivan Tashev: Okay. We'll start to move quickly. So so far we're dealing with sounds we
don't know about, those other noises. We don't have to do with them anything. They are just
outside of our sound capturing.
There are also group of sounds which we know. This is what we send to the loudspeakers
because when you watch a movie, there is a quite loud sound coming from the loudspeakers.
When you talk with somebody, he's -- or the other voice from the other side came from the
loudspeakers. It would be nice to be able to remove those sounds because we want to capture
only the local sound. And this process is called acoustic echo cancellation.
This is one of the first -- the most frequently used, one of the earliest signal processing
algorithms. What happens is that this is part of every single speakerphone and every single
mobile phone. This scenario is very simple. We have two rooms, microphone, loudspeaker,
microphone, loudspeaker. This is called near end room, our room. The other is called far end
room.
If you don't like this, what is going to happen is this guy speaks, it is captured, transferred, it's
transmitted by this loudspeaker, captured by this microphone, played back in the near end room,
captured by this microphone. And what happens is every word ends up with kind of echoes.
In the worst case we can even have a feedback, which frequently happens when you place a
microphone close to the speakers.
So if you can somehow remove from here the sound we sent to the loudspeaker, then that will be
nice because on the output of this acoustic echo cancellation we'll have just the local sound, the
near end sound. And this is pretty much usually, in most of the systems, happens in two stages.
The first is known as acoustic echo canceler. That's the first application of the so-called adaptive
filters.
So roughly what happens is that we send this signal to the loudspeaker, and it goes and is
captured by the microphone [inaudible] with a function called room impulse response. Not the
same for all frequencies, and for some frequencies not at all but in general after filtering this
signal ends here.
And of course there is a local speech and the local noise. If we can somehow estimate this filter,
then we can go and filter the signal we sent to the loudspeakers and simply subtract it.
Theoretically we will end up with just the local speech and the noise, because if our estimation
here is correct, we'll be able to subtract the entire portion of the loudspeaker signal.
This does not happen usually. Regardless of the fact that, yes, you can see this filter is adaptive,
we carefully watch the output and tried to trick the coefficients of the filter on the fly so we can
minimize output here.
And what is left here is so-called -- from the loudspeaker signal is so-called residual. And
usually the typical well-designed acoustic echo canceler removes 15 to 20 dB of the loudspeaker
signal. So it's 20 dB down, still audible. What we can do more.
Then the next reasonable step, if you guys remember, we had the linear beamformer and then it
was that suppresser thing which we applied gains between 0 and 1. Same thing follows. We
already did what we could dealing with the signal as a complex number with phases and
magnitudes.
What's left is we can do some estimation of the energy and directly suppress it, estimate again
between 0 and 1 for each frequency build, and try to suppress what's left from this energy in the
same way as the noise suppresser of this, except that instead of having a constant noise model
here we do on-the-fly estimation how much of this energy is still left here.
And this is called acoustic echo suppresser. So the combination between acoustic echo canceler
and acoustic echo suppresser usually is -- name it as acoustic echo reduction system.
So far we had one loudspeaker and one microphone. That's a perfect speakerphone. And this is
pretty much what happens in every single speakerphone.
In addition, what we can do is here we have one more [inaudible] which is the stationary noise
suppresser, the first signal processing algorithm we discussed. And, voila, you have a pretty nice
and decent speakerphone.
Okay. That's good. So now we can do speaker phones, and the question is can we do stereo.
The first idea that came up is, okay, we'll chain to acoustic echo cancelers, one of them will get
the left channel, the other will get the right channel, we'll subtract whatever they capture, and
we'll have our echo suppressant.
This does not work. And the reason for this is that those two signals are highly correlated.
Pretty much on the two left and right loudspeakers, if I want to play a sound source right in the
middle, I'll hear them the same. But if the sound source is more on the left, we have it more on
the left loudspeaker channel, but pretty much attenuated and delayed signal will emit from the
right channel.
And this means that the acoustic echo canceler, if you do it in this way, will be chasing those
phantom sound sources, not the actual signals from the speakers. Because of this correlation,
we -- okay. Let me put it in this way.
Those adaptive filter tried to find a solution of a mathematical problem. We have two filters to
estimate and just one equation. So pretty much this means that we have infinite number of
solutions. And the chances our two adaptive filters to converge to the right solution, so if the
proportion between left and the right channel changes and [inaudible] converges is very minimal.
What happens in reality is that the acoustic adaptive filter tried to change those phantom sound
sources, they constantly tried to find a better solution, and, as a result, we have a very bad echo
suppression. It goes down 3, 4, 5 dB, somewhere there.
So potential approaches, and they're basically written all over specialized papers, is what we can
do, can we de-correlate the two loudspeaker channels. One of the guys actually suggested, really
suggested in a scientific paper, what if we introduce 10 percent nonlinear distortions on the left
channel. This is sufficient to de-correlate the channels, and, voila, we have those two adaptive
filter converging properly.
10 percent THD is painful to listen, actually. So in this particular direction, people continued to
see, okay, what if we use the psychoacoustics and start to introduce inaudible distortions,
distortions which humans cannot hear, and still to keep those two channels de-correlated in the
way sufficient for those two filter to converge. And there are potential solutions in that area.
There are other papers which say, okay, those guys they cannot converge with just the acoustic
echo suppression, that's costly because it introduces nonlinear distortions.
What we do in devices such as Kinect is try to leave with [inaudible]. We have a microphone.
We have one adaptive filter. That's it. We cannot have more than one.
Then the question is can we somehow learn the way signals are mixed in the microphone. And
then to mix the signals from which we send to the speakers and to use just a single adaptive filter
to compensate for people moving around and changing. Because my loudspeakers are there, the
microphone is here, but when I move, this pretty much the sound reflects, bounces from me.
And this slightly changes the filters which I have to use. And hopefully this guy can compensate
for this.
Initially what we do is [inaudible] signal. This is pretty much a linear frequency increased from
200 to 7,200 hertz and just estimate those filter during the first calibration phase. That was
working perfectly to the moment. My colleagues in Xbox time said this sounds ugly. This is not
for a consumer product. So this is why today when you go to Xbox and listen, you'll hear this
sound twice.
[audio playing]
>> Ivan Tashev: So this music was carefully selected to have every single frequency between
200 and 7,200 hertz presented. It sounds way better than the ugly linear chirp. And this happens
during the installation of your Xbox device and Kinect device. And then we load and store this
mixing matrix and use one single adaptive filter in real time when you watch the movie, and
there is a sound and shooting and you want this Xbox [inaudible]. This is what happens, that on
each of the four microphones we get rid of most of the echo here.
So this is one of the things we just brought you. I want to underline how I'm part of this
algorithm. The inventor of the acoustic echo cancellation himself from Bell Labs in 1991 wrote
a paper stating stereo acoustic echo cancellation is not possible. And in 2007 we demonstrated a
working solution during our Microsoft Research TechFest, and in 2010 it's a part of [inaudible]
product.
So it's a little bit more complex than it looks on those simple diagrams, but I will use this just to
make a simple demo here. So this is what one of the microphones capture.
[audio playing]
>> Ivan Tashev: And this is what we have on the output.
[audio playing]
>> Ivan Tashev: So as we can see, we can get rid of most of the sound from the loudspeakers. I
want to underline again how challenging is this.
When you stay and use this algorithm in your conference room, usually you set up the level of
the loudspeakers to a level which is human voice. We speak between around 60 dB SPL; this is
how much you will set the loudspeaker level.
When you sit in your gaming room and start to play a game or to listen to a movie, usually the
sound level is around 75, 80, 85 [inaudible] we measured crazy people which listen at 90 dB
SPL. 90 dB SPL is 30 dB above your voice. This is 30 times more energy comes through the
speakers than from your voice and you have to get rid of those and to have your voice at least 20,
25 dB above the noise flow, what's left so you can get distant communication in speech
recognition.
Okay. What you can do next. So, again, this talk is not just Kinect; we are talking a lot of
algorithms which are sound capturing related. One of the classic algorithms which is floating
around is so-called sound source separation.
So we have audio channel with two voices mixed. Can we separate them. Yes, we can do this
even with a single channel. A microphone array with more than one channel gives us additional
queues because at least those two voices are in different directions.
And we have pretty much two separate group of algorithms, one of them is so-called lined source
separation. But mathematically speaking the question is we have this signal, can we split it in
two in the way those results -- we maximize statistic -- maximize statistical independent.
Presumably voice 1, voice 2.
This is just uses the fact that the two voices are statistically not correlated. The second is that
usually those two people are in two different directions. So can we use the microphone array to
say, by the way, one of them is here and one of them is there.
So to enforce, because we use different properties, presumably this happens in very real cases in
signal processing, but those two [inaudible] thinks eventually the effects of them can sum.
Usually you have one algorithm which does separation 10 dB and the other does 10 dB and we
chain them and instead of 20 you get 12.
So the good things not always sum, but in this particular case, because we have to do
combination of orthogonal features, we can expect that this is going to help them. Actually it
happens. We combine those two algorithms here just to see how much we can separate them.
This is distance in meters. And this is the angle between the sound sources.
Roughly think about this as a four-seat couch, 1.5 meters. This is person 1, two persons in
between, this is the most left and the most right person. You move it further, the distance
between them is the same. But because it goes further, the angle shrinks.
So the same couch at four meters, we have just 25, 26 degrees between those four, two outer
persons outside -- or outer persons on the couch.
And pretty much this is how many dB you can separate those two sound sources. Even can
separate the other guy 20 dB. This is already quite good. And then this happens up to 2.5, 2.7
meters.
>>: Can you define SIR and PESQ, please.
>> Ivan Tashev: SIR is signal-to-interference ratio. If they are the same equal loudness, how
much you will separate one of them.
PESQ is something which most of the audio people are not familiar with. And it stands for
perceptual evaluation of sound quality. So this is a standard which comes from
telecommunications. There is ITU-T, International Telecommunication Union,
Recommendation P.800 to which is called MOS testing. How we can evaluate how good is the
sound quality, how humans perceive it.
And it's as simple as this. You get 100 people, ask them to listen to set of recordings and to
evaluate them from 1 to 5. Then you average. And this is the mean opinion score. It's a number
between 1 and 5, where 1 is very bad quality, I can not understand, and 5 is crystal-clear quality.
This is time-consuming. This is something which you're not willing to do quite constantly.
PESQ is a signal processing algorithm which is a proxy of this type of measurement. So we
have the clean speech signal, we have the [inaudible] signal running through this signal
processing algorithm, which is a standard in AT&T. And you get the number from 1 to 5, which
is very highly correlated to what you can do with real people.
What happens in 1/10 of the time of the duration of your recording. So pretty much increasing
the PESQ means you get a better perceptual quality. 0.1, 1. is audible if you are in the business.
Not expert on audio, but can listen well. 0.2, improvement in PESQ. My grandmother should be
able to tell the difference. And here we talk about 0.5, 0.4. So apparently the perceptual quality,
improvement is well audible.
Again, you do heavy processing, you improve the quality of the signal you want, the heavy
processing introduces distortions which reduce the quality. It's always question of balance. And
this is why we carefully, when we do signal processing algorithms, especially for [inaudible]
devices, we carefully watch this coefficient. Because [inaudible] is the feedback, how did we
overdo it. If you overdo it, yes, the nasty signal you want to suppress is gone. But the one you
want to [inaudible] already sounds very bad.
So PESQ is actually one of the measures how well we do. Usually at some point you tighten the
bolt of the algorithm, it goes up, up, up, and at some point you overdo it, PESQ start to go down.
This is the moment you have to stop and return a little bit.
>>: So is there any intelligibility measure as a standard?
>> Ivan Tashev: There are intelligibility measurements. They are mostly on humans and pretty
much ->>: There's nothing ->> Ivan Tashev: As far as the intelligibility is considered, the only thing we can do is to
decrease it. Pretty much you cannot beat the human brain.
>>: No, no, no, I mean is the algorithm similar to PESQ.
>> Ivan Tashev: Oh, automated. Nope.
So skip this. And listen to demo. We have two persons shoulder by shoulder, 2.4 meters away
speaking. One of these in Chinese, the other in Korean.
[audio playing]
>> Ivan Tashev: And then we do our magic using the power of the microphone array and
independent component analysis.
[audio playing]
>> Ivan Tashev: So this is not selecting one voice from 500, but still 2.4 meters with all the
reverberation and noises in that room to be able to sense out the qualities might be not great but
sufficient for speech recognition, sufficient for telecommunication. Technically this means I can
sit on the couch with my wife and we can watch the two channels simultaneously. Most
probably she's on the large screen watching the soaps and I'm the small screen, the latest football
game. And we can speak and send voice commands, and for each one of the screens they'll be
executed separately.
So pretty much this is one of the potential applications of this algorithm.
>>: Ivan, how much dB do you do differentiations at?
>> Ivan Tashev: So what we just listened to was around 21, 22 dB. If we turn back to the chart,
we're talking 2.4 meters, 26 degrees, yes, around 21, 22 dB. Pretty much if you go closer here,
you're 28. This is one -- another interesting thing. If I add 3 here, the blue is what we already
have published in this area, and people were claiming, wow, we did 1 dB. First time we tried to
publish that, we achieved in the same conditions 20 dB, our paper was rejected. This was not
possible.
So, again, we do a lot of complicated signal processing. Why? Why? What we should do?
What are the applications?
And here is a typical application for cars. You guys have heard about Ford SYNC, Kia UVO,
Nissan LEAF, all those three cars share one same property: Their in-car entertainment system
runs on Microsoft Auto Platform, which, on top of everything, contains audio stack for
telecommunication and they do speech recognition.
Telecommunication. Okay. This converts the car audio system into gigantic [inaudible] headset.
For those -- the driver who listens to the nice and cool audio system in the good speakers, that's a
pleasure. Unfortunately, the noise in the car is not loud. That when you do acoustic echo
cancellation and now noise suppression and then do encoding and decoding to get you to the
GSM or CDMA telephone line.
On the other hand, people are not happy, and I can play you what they hear. So this is what we
have in the car.
[audio playing]
>> Ivan Tashev: And, on the other hand, what we have is...
[audio playing]
>> Ivan Tashev: So I selected the segment when the noise kind of slowly increases. This is the
car accelerating on the highway. So you notice that those brakes -- this is not framed [inaudible]
during the transfer of the audio. This is just the encoders refuses to work correctly, so bad
signal-to-noise ratio. The standard GSM encoder is not adapted to and designed to work on so
high -- so low signal-to-noise ratio.
Second thing we notice is that those all processing blocks out optimal in a different way. Some
of them are medium mean square error, lock medium mean square error. PESQ is for codecs.
And then the idea is, okay, what if we go and optimize end to end. So we have a substantial
amount of recordings we do and process them through all of those blocks, including encoder and
decoder, simulating the entire telephone line.
And here we measure PESQ. And then the question is can we go and start to tweak the nuts and
the bolts and the algorithm to tune it in a way to maximize PESQ at the other end of the
telephone line, to maximize the user experience end to end.
And of course because the noise is too high, what we're going to do is to add one microphone
array, two, maybe four elements to do the acoustic echo cancellation first on each of the
channels, to our beamformer and the noise suppresser, and then this line, entirely optimize it,
brings us something like this. If I can find my cursor. Okay.
[audio playing]
>> Ivan Tashev: So first thing you notice it, we can hear the sound in the car. And it's
completely fine. As for us, the voice of the speaker has maximum perceived quality.
>>: Back at the load it should be easy to suppress [inaudible] just like in your earlier case, two
speakers ->> Ivan Tashev: Every suppression has a price. You squeeze 14 dB noise suppression, that's
already on the border. You go 20, and you start to have badly ->>: The earlier case when the distortion is much less.
>> Ivan Tashev: This is the same audio signal. Just those 6, 7, 8, 12 dB we get from here
actually were now helpful to get the encoder to work properly. And pretty much the
optimization process stop it from the moment the encoder started to work properly without
overdoing it. This is the key idea here, that you do optimization end to end and get the
maximum of -- in terms of perceived sound quality.
So what we can do and why we have to do this in the car. Because when you drive, your hands
are busy, your eyes are busy. What's left is ears and mouth. So speech is kind of a good media
to communicate with the computer system of the car. And we are not talking about the computer
who does the ignition or controls the gearbox; we're talking about the computer which handles
your music, your telephone, stuff which is related to the information and entertainment of the
user.
So speech is a good modality, a good component of the multimodal human computer interface.
It contains buttons, it contains graphic screen, still it's okay to kind of look at. In general when
you design such systems, the question you always have to answer is do I want drivers which I
share the road with to do this in their cars.
So without going very further, I'll show you short movie segment of what we have designed here
in MSR.
>> Video playing: [inaudible] principal architect in Speech Technology Group in Microsoft
research. We're in a driving simulator lab. And I'm going to demonstrate CommuteUX.
CommuteUX is a next-generation in-car dialogue system which allows you to say anything
anytime. This is a dialogue system which employs natural language input. And the system still
will be able to understand you.
Let's see how it works. At the beginning of our trip, let's start with listening to some music. Play
track Life of the Party. So this was the classic way to listen or to set the speech query. But we
don't have to specify track or artist. Play Paul Simon. So as we can see, the system didn't need
the clarification about the artist. Play the Submarine by the Beatles. And it can even understand
not correctly specified names.
Let's make some phone calls. My Bluetooth telephone is paired to the system, and the computer
knows my address book. So I can ask directly by name. Call Jason Whitehorn. "Calling Jason
A. Whitehorn; say yes or no." Yes. "Jason A. Whitehorn; do you want cell, work, or home?"
Cell. "Calling Jason A. Whitehorn at cell."
So this was a classic and slightly painful way to interact with the Bluetooth phone. But our
system can understand and doesn't need disambiguation if the name is unique. For example, call
Constantine at home. "Calling Constantine Evangelo Lignos at home."
But there is even more. Frequently the driver needs to respond to some urgent text messages,
and it's definitely not safe while you're driving. So we can use our speech input to do the same.
"Message from Juan: ETA? Say reply, delete, call back, or skip." Reply: In 10 minutes. "Am I
right that you want in 10 minutes, or a number for the list?" Yes. "Got it. Message sent."
Maybe you notice it, but there is some irregularity in movement of our car, and I'm afraid that we
have a flat tire. And because this is a rented car, I even don't know how to open the trunk. So
now is the time to open the owners manual and to try to figure this out. But instead of this, our
computer already did this for us. So we can just ask. How to open the trunk. And we see the
proper page here. But then we need to replace the flat tire. How to replace a flat tire? And we
go directly to the page of the owners manual which it describes how to replace the flat tire.
Once we have our flat tire fixed, we can continue our trip. And listen to some music. Play Nora
Jones.
>> Ivan Tashev: So two things here. During -- so, you know, on film, on, you know, movie you
can do everything. This was done on the fly pretty much with one shot. And we didn't have
much training to do this. I had two or three persons in the driving simulator project with some
microphones behind me, and I made a mistake only once when I almost crash the car. But
otherwise this is kind of a proof that you can drive and to operate the system with less distraction
than usual.
In all cases, responding to text messages in this way is way safer than trying to punch the keys.
Yes, it's forbidden in most of the states and still most of the teenagers do it. And the reason here
isn't -- the question is not do we want to drivers which we share the road with do the same -- to
do this, it's more do we want them to type or them to respond with voice at least when -- on most
urgent messages.
Anyway, this is a kind of demonstration of one of the application of capturing sounds, not only
for telecommunication, but for one of the modalities to operate with the computer.
And of course the next thing is we sit in our media room and try to find the remote control or
argue for the remote control, and the question is can we use our voice for this. And one of my
favorite [inaudible] basically the man is laying on the sofa and the remote control is on the other
end of the coffee table, and the guy basically lifts the coffee table so it can slide in his hand
[inaudible]. So now with devices like Kinect you don't have to do even this.
So what we have in Kinect. So we have that multichannel acoustic echo cancellation, something
which no other company has [inaudible] algorithm so far.
We have a pretty decent beamforming and spatial filtering. And we have eventually in some
case this sound source separation algorithm.
What Kinect has in addition to this, there is a 3D video camera. Okay. There are two cameras:
one of them is invisible and the other is infrared depth camera. And this four-element
microphone array.
The depth camera is another new, pretty much evolutionary thing in the area of human-computer
interaction. It works in complete darkness and allows us to do very, very -- okay, not very easy,
but allows us to do gesture recognition. What happens is that once you stand in front of the
camera, after some processing, your body is assembled and you have XYZ of each of the major
joints. Initially there was just the head and the two hands, but later we can do the knees, ankles,
even if there is a woman which wears a long skirt, we are able to find the knees and the ankles so
we can track the body.
Then gestures like this is if the Z of this joint is bigger than the Z of this joint, boom, we have a
hand raise. Gesture like this, if the Euclidean distance between this joint and this joint are closer
than given distance, we have a clap, et cetera, et cetera. Most of the gestures we operate are
quite simple to program after both these sophisticated
skeleton tracking layer, which is
part of the new XDK platform.
And this allows game developers actually to create very fancy and interesting games. We will
see some of them in the demo. But before we to go there, couple of minutes of illustration of
what we can do in combination -- no? Okay. Let me find this.
[audio playing]
>> Ivan Tashev: So, in short, having the depth camera, combine it with the microphone array
opens new ways to communicate with the computers. I'll demonstrate quickly the gesture
recognition in the Kinect dashboard so we can see how we can go select, start a game, et cetera,
and create a pretty much sophisticated and easy and intuitive ways to -- humans to communicate
with the computers.
Yes, we can create very funny games. I'm not going to dance and I'm not going to play Kinect
Adventures tonight. But mostly I think the value of this combination between sound and gesture
and speech recognition games that we can communicate and talk with and interact with the
computers in a way more natural way than we can do with keyboard and mouse as we are used
to.
So the last thing, if you guys haven't seen, this is the book which I published last year. If you
want to go a little bit deeper and to play a little bit with those algorithms we discussed, there are
some MATLAB [inaudible] there, so there are some .wav files so you can go and do something
by yourself if you're interested of this.
Before to go to questions I want to show you some things here. So you can see my image in the
depth camera. Now I have gestures and control and can go -- I think the game will start, but
we'll try to -- okay. One of the ways to interact is combining basic using gestures. So the game
is Dance Central. And technically it's just you have to dance together with Lady Gaga and other.
But the only thing I want to show here is the way we can select and to use the menus.
[audio playing]
>> Ivan Tashev: Okay. I can see the [inaudible]. Okay. And then dance select. Who's playing.
And, for example, look, I have my left hand here for the left [inaudible] moves. I can have my
right hand. No mouse, no controller, nothing. And this is where the dancing begins. So pretty
much very natural, no [inaudible].
[audio playing]
>> Ivan Tashev: Who's going to dance? Come on up. Come on. Come here. It's easy. Trust
me. Come on.
[applause]
[audio playing]
>> Ivan Tashev: Okay. Watch one of your hands. Okay. Yes. Who is one of us. Screen
control. Okay. That's -- so try like this. Resume. Just repeat it. If you see your head, you're not
doing it. Very nice.
[audio playing]
>> Ivan Tashev: Nice.
>>: [inaudible] virtual mixture one?
>> Ivan Tashev: Which one?
>>: You know, the virtual -- you'll have a virtual mixture so you can mix, you know, 64-track
mix.
>> Ivan Tashev: It's possible.
>>: You just don't want to slip and fall down [inaudible].
>> Ivan Tashev: And just one more quick demo. For this I'll need to mute the microphones.
Okay. So you see I'm almost ready. Xbox. Kinect. I don't have to touch the controller, I don't
have to make gestures.
>>: [inaudible] microphone here.
>> Ivan Tashev: This microphone is off hopefully. What we use is the Xbox microphones there.
So pretty much you have to say anything, just proceed the command by the keyword Xbox. This
is another thing which was a difficult research problem 18 months ago. Open microphone and
no push-to-talk button. You guys remember in the car I pushed the button. Say whatever I want
to say means I have to tell the computer I'm talking to you. Now the microphones listen to my
voice. There is no reaction, but when I can say Xbox, Kinect ID. Is the microphone active?
Xbox. Kinect ID.
So pretty much everything I see on the screen in black and white is a command I can say, et
cetera, et cetera. It will start the identification software, which I can go. This is a different way
to select. You pause and hold. And I want this different profile. And then go and sign for this
guy. And this is just a quick demonstration, I know many people like this, to see how you move
and you control the avatar. And that's the last thing I'm going to show you tonight.
[audio playing]
>> Ivan Tashev: So as you can see -- oh, you want me here. Okay.
[audio playing]
>> Ivan Tashev: Okay. This ends the whole story. So that was a demonstration of strange and
still not quite known way to communicate with the computer. I'm open for your questions. I
promise you to stay let's say 20, 30 minutes after the talk to start some of the games. You guys
can play Kinect Adventure, Jump, et cetera, et cetera.
For me, pretty much not into the gaming. This device breaks two barriers. The first is the
gender barrier. The image of the gamer holding the controller sitting, usually male, early -teenager, early 20s, and shooting the enemies is gone. Dancing is going to open the gaming for
the second gender. And the second is that, for God's sake, at some point, we have to stand from
the couch, stand in front of the device and start to move. And this is for every single person,
regardless of the age. So it breaks the age barrier as well.
That was it. Thank you. You were a very, very, very good audience. Questions, please. Here.
>>: So what are other [inaudible] of operating besides command and control? Are there
anything ->> Ivan Tashev: So there are games -- so think about Kinect and in general Xbox is a kind of
operating system. It exposes certain interfaces. Whatever the game companies decide to put,
they have access to the speech interfaces, they have access to the skeleton tracking, and then this
is their fantasy.
It's tricky. Speech is not on the level in speech recognition which is the graphics and the control
handling. So for on the games we have today available for Kinect only to use voice and speech,
one of them is Kinectimals where you can name your animal and send some commands and the
animal listens to you. The other is -- actually the biggest surprise on the market, an extremely
good speech recognizer, very rich grammar and very good communication with the computer,
how they did this, we don't know. Because it's not a Microsoft game. But that's pretty much -we're going to improve the underlying technologies, but what and how the gaming companies
which create the games are going to do with the speech gestures, et cetera, et cetera, it's up to
them.
Go ahead.
>>: [inaudible] was a customer able to upgrade their firmware when new versions come out ->> Ivan Tashev: Yes.
>>: -- in Kinect itself?
>> Ivan Tashev: Yes. The firmware is -- pretty much the device has only booter. From the
moment you connect it to your Xbox, it downloads the latest, greatest from Internet and pushes
through the device.
>>: Using -- if you have the camera to know -- or the microphone array to know the location of
a person, and I turn my head, how accurate can you tell the angle that I'm speaking at the
microphone using that system? Is that -- what I was imagining was Dune where you yell the
direction you want to send your blast the science fiction book. In order to make an attack, could
I be -- have an opponent and yell that direction and shoot something?
>> Ivan Tashev: The gesture can do this. We do not have a technology part of the XDK which
allows you to guess the head orientation. I'm not saying is it not doable, I'm not saying we don't
have the technology. It's not available in the existing ->>: [inaudible] with the microphones, though.
>> Ivan Tashev: With the microphones, what they can do is to estimate the direction means this
angle and this plane. So it can tell you where you are, but it cannot tell you where is -- what is
the orientation of your head. And partially you can do this with the depth camera, but that's
something which it's not part of the interfaces in the technologies XDK exposes.
The game companies, can they do this? Yes, they have access to the raw data. That actually
gives them advantage because they have something which [inaudible]. It's doable. It's not in the
existing XDK.
>>: I have two questions. The thing you did at the end there where you were stepping from
green square to green square and holding poses, was that some kind of calibration?
>> Ivan Tashev: So this is Kinect identification. What the program was doing was trying to
estimate my biometrics, the size of my bones, and this allows this program later when I just stand
in front of the device to say this is Ivan. This is not as difficult as it sounds because in your
family you have three, four, five people, so usually different size.
So you can relatively and reliably do a person identification based on biometrics. It is not a
serious problem. It is not in the existing XDK. But recognizing by voice is another signal
processing algorithm which we can use.
And, by the way, the combination of biometrics plus voice, which are orthogonal again, can give
you in combination a very reliable identification pretty much beyond the need of what the -- the
casual atmosphere. And you play a game, it's a trivia game, you just sit down and the scores
goes to your account, yes, you need maybe 97, 98 percent recognition. And you can do this by
voice with the biometrics or with biometrics without voice. So it's kind of reliable. But
eventually at some point we may have [inaudible].
>>: And in the earlier example when you were driving in the car and telling the thing to play a
song, what if you said -- were having a conversation and you said I'm going to the play tonight.
Would it immediately grab on that word play and play the song Tonight?
>> Ivan Tashev: So to a certain degree yes. This is why I said this is a natural language input.
So what happens is that we go one degree beyond just recognizing the words. We try to
understand the meaning. Even a perfect speech recognizer, even today's speech recognizers are
better than the humans. The major source of irregular queries or not quite -- these are ambiguous
queries is the human.
So we try and give certain freedom. It's not as good as just you're having a conversation, but in
general you will tell our users you have to say first the command, what you want played, and
then what you want to be played. Even we tell them play, and then you have to say artist, album,
track, but usually they are mistaken. We hear commands like play track B of Beatles or
something like this.
On the back side, we ask the users to stick with this structure. But on the back side we can
actually handle stuff like play me -- what was that song about somebody's -- and usually this will
get you the Yellow Submarine. But we don't advertise this because this is our basically backup,
our [inaudible] to handle human errors.
So technically most of the cases it will handle it properly. How good will be to insert entire
conversation and just [inaudible] extract some words, maybe not.
>>: But from the car you have to push the button before you ->> Ivan Tashev: Yes. It happens once just to initiate the conversation. So presumably I'm
talking to the passenger, want to talk to the system, push-to-talk, the button, then the
conversation of the computer starts. Presumably we can exchange couple of phrases, the
computer can ask for disambiguation, et cetera, but there's the conversation, the opening and
closing a task. The task can be call somebody, can be play that song, et cetera, et cetera. During
this you don't have to push the button again.
But at least once you have to signal I want to talk with you. This is not the case with the Kinect.
This is another brand-new thing unknown so far, very difficult to achieve in the research
environment. And it's not commercial product.
>>: In the car, once you've -- once the computer's completed your request, does it continue to
listen to you if you want to make another request? Or do you have to push the button again for a
new request?
>> Ivan Tashev: It's all task level. You have a certain task, you signal the beginning, you talk,
finish the conversation, end to the task, the computer sleeps.
>>: Okay.
>>: Ivan, does it learn when -- in the car when you're making requests over time, does it learn
your terminology like a smartphone does?
>> Ivan Tashev: So it's a question about how much your smartphone does. I think one of the
next things we want to do in this Microsoft Auto Platform is -- and actually it's valid for Xbox as
well, there's more recognition who is talking and the [inaudible] the speech recognition model
towards this particular person.
Then once we have -- let's say I go to the car, I do by opening, because keys are different, they
have unique code, or by some biometrics, but my voice, it recognizes me, loads my acoustical
models which are adapted for my voice. And then next step will be, oh, and by the way, this guy
have a very -- this guy have a very funny way to express the tasks, but I can adapt. Base it on the
feedback. If he confers frequently, I can put this in start to learn. But this is -- we're not there
yet. And pretty much the first adaptation towards the person on the acoustical level is the first
thing which we can and should do before to go to that level.
Rick?
>>: How much is the speak recognition affected by accent?
>> Ivan Tashev: Okay, it works for my accent. In Speech Technology Group in Microsoft
Research, we have from 10, from 11 people -- how many? -- one or two native speakers.
>>: Two.
>> Ivan Tashev: Two native speakers.
>>: Three, three, three.
>> Ivan Tashev: Three. So the eight of us are born outside of the U.S., pretty much on every
single continent. They work for us. So we train the speech recognizers.
>>: How much of the process is -- is all of your audio processing done in the box? Or is some
of that done in the Xbox, same with the video positioning? Is that done in the box, or is that
done in the ->> Ivan Tashev: Both. Actually, for in my cases we have several instances of the audio pipeline
running. When you do a game chat, means you're playing how, et cetera, et cetera, and your
realtime communication with somehow playing this on Xbox Live, games they don't want audio
processing to steal any CPU cycles because they are busy rendering something that are complex
graphics.
Then we have a specialized [inaudible] CPU inside the device. It's not very powerful. It's a
mobile phone-style CPU 2005 [inaudible] 200 megahertz. And it can handle the entire audio
stack.
For speech recognition, this happens on the console. We drag the four channels down because
the audio stack for speech recognition is a little bit more sophisticated. We do more processing
there trying to clean up because speech recognizer in general is more sensitive to the [inaudible]
and the noises.
And when you do voice chat using Windows Live Messenger, and you can do this from your
room, and I actually appreciate this very much because I sit in my media room and I talk with my
opponents across the ocean, then there is a -- the same voice audio stack [inaudible] on the
device [inaudible] on the console because it's kind of pointless to send it to the device. So
several -- three variants of the same code [inaudible] executed either on the console or on the
CPU -- on the device.
The video up to the depth camera. Processing happens inside the device. There are three or four
DSP processors which handle the video and the infrared. And what we drag down is
black-and-white image. Every pixel is not brightness but distance.
All the skeleton tracking happens on the console. This is very sophisticated, complicated
algorithm. It cannot run on the device. It can, but it should be two times bigger and one more
fan to get rid of the heat from inside.
So this is pretty much where we are.
>>: What's the maximum distance that the device can be from the console?
>> Ivan Tashev: It's a USB. You already can buy from Amazon a USB cable extender for
Kinect. So 20 feet? I don't know. Most probably. Maybe a little bit more.
>>: 50 feet?
>>: [inaudible]
>> Ivan Tashev: I haven't tried.
>>: These are full speed. That would help.
>> Ivan Tashev: It's a full-speed USB.
>>: 12 megahertz? I get them confused. 12 megahertz or the hundreds of megahertz?
>> Ivan Tashev: Hundreds of megahertz.
>>: Okay.
>> Ivan Tashev: It's a USB tool. Full speed. And everything that can run on that uncertain
distance should be able to run Kinect. But I haven't tried. This is pretty much -- it comes with
this cable, and this is what I use.
>>: And all power supplied by the USB?
>> Ivan Tashev: Nope.
>>: Oh, okay. I don't see it. All I see is one ->> Ivan Tashev: So this is, you know, standard USB. It carries all the standard USB signals.
But instead of plus five we have one amp, we have a way more powerful power supply. It
goes -- and then the rest is connected to --
>>: Oh, it breaks out of the ->> Ivan Tashev: Yes.
>>: Okay. Gotcha.
>> Ivan Tashev: You cannot feed 3G, 3, 4G speeds plus a separate CPU plus the cameras plus
the infrared projector with one amp. Period.
>>: I'm curious of any plans of developing protocols [inaudible] talk to my own receiver or
third-party hardware.
>> Ivan Tashev: Microsoft is unlikely to release this. You can buy Kinect device. You can
search Internet. Seven days after we ship Kinect, you can download hacker drivers already for
Windows.
>>: Doesn't that sell devices? Don't you guys like that? I mean, people are going to buy it just
for that.
>> Ivan Tashev: This is cool and nice and actually the margin is not bad for the device itself.
But if you use it with Xbox we get royalties from the games as well. Otherwise, it's completely
okay -- you guys understand that Microsoft Research has a very good relationship with most of
the professors in academia. They are dying, most of them, to get hold on the device to plug it
into a Windows machine and to start to experiment with human-computer interaction interfaces.
But from our business team point of view, okay, a bunch of researchers want to do this and that's
pretty much it, how many a day, how much it will cost us to ship the drivers, naaa.
So we may even -- MSR may eventually release some drivers not as a product but as a free
download relatively soon. There are plans in discussion for this. But no promise. We don't
know what is going to happen here.
It would be nice to go and to play on your Windows machine. And most of the algorithms
actually are designed in Windows for both video and audio. We have this connected to our
Windows machines, but this is not something we can share for now.
>>: Eventually probably.
>> Ivan Tashev: Sooner or later.
>>: [inaudible].
>> Ivan Tashev: I thought that the day -- the hackers will need at least two or three months
before to release those drivers. On the seventh day they are already available for after we ship
Kinect.
>>: [inaudible] amazing.
>> Ivan Tashev: Yes.
>>: [inaudible].
>> Ivan Tashev: Which guys?
>>: I don't know, the guys who came up with the drivers in seven days.
>> Ivan Tashev: They're not Microsoft employees. More questions or it's play time? Thank you
very much. Thank you.
[applause]
Download