>>: We'll get started with two talks on music... Eric Battenberg from UC Berkeley.

advertisement
>>: We'll get started with two talks on music and audio processing. So first we have
Eric Battenberg from UC Berkeley.
>> Eric Battenberg: All right. My name's Eric Battenberg and I'm with the Center For
New Music and Audio Technologies and the Par Lab at UC Berkeley. We call this place
CNMAT for short. That's that cool symmetric logo up there next to our bridge.
And this talk is entitled the "Breadth of Applications for Music." And I'm not going to be
focusing on one particular application but a few different things we're working on at
CNMAT.
So there are hundreds -- literally hundreds of applications for music and also plug-ins.
And plug-ins are things you can -- little pieces of code you can use to extend existing
music applications.
So these are the four main areas we're working on at CNMAT and at the Par Lab for
music applications. The first is performance and composition, and put some cool logos to
grab people's attention here. First Guitar Hero, that's one way to interact with the
computer musically. Also the Max/MSP visual programming environment is something
that musicians have been using a lot. And Pro Tools is the most popular audio editing
software, and that's just a way for amateur musicians to edit their audio at home.
Second, music information retrieval, and that's a lot like content-based image retrieval. It
hasn't seen as much academic attention, but there are companies popping up doing a bit
of this, and then there's also this IRCAM Institute in France that does a lot of research in
that area.
Third, hearing augmentation for music. And this is just for hearing-impaired people to
enjoy music more. The types of processing that you need to do to enhance how music
sounds for the hearing impaired is very different from what you need to do to make
speech more intelligible. So we're working on that.
And this is a little subjective space that we train with a neural network that enables the
hearing aid wearer to kind of navigate around and decide on a fitting that's best for them,
a set of parameters that process the audio according to their specific hearing loss.
And last is three-dimensional sound. This is large arrays of speakers and microphones
for reproducing three-dimensional sound. We're working with Meyer Sound. And over
here we have an example of a 120-channel spherical speaker array. It's about this big and
it's used to recreate really cool three-dimensional sounds, real-sounding instruments
because you get really interesting radiation patterns in real life rather than just a single
speaker pointing in a certain direction.
So in this talk a little bit of background on music applications, and then some insights
into music and parallel computing, the implications that music applications have that -the requirements that they need from parallel computing in order to make it to work, in
order to make it work.
And then I'm going to talk about a case study, parallelizing drum track extraction which
is a particular type of music information retrieval source separation, on OpenMP and
CUDA. These are -- this is targeting GPU architectures and video GPUs and just
multicore processors in general.
But before that, I'm going to talk a little bit about how to communicate the computational
needs of this project here using parallel design patterns. And this is just a high-level way
to kind of understand what the needs are, what things are being done in my application.
And last a fun brainstorm, something cool, the future of music information retrieval. And
hopefully it will be a fun, motivating example.
So for music composition we're working with different musical interfaces, electronic
interfaces. These two are developed at CNMAT. This is a touch-sensitive drum with
conductive fabric. You can do interesting signal processing on that. We have this 24-pad
Multi-Touch Array that is pressure and location sensitive. And David Wessel performs
some interesting pieces with this.
And last, we're not working directly on this, but we're somewhat involved. This is the
Reactable -- yeah, Reactable -- it's a cool table where you place these little tangibles, they
call them, little things on here and make really cool music. I suggest you Google that,
because it's a really cool example and it's really easy for novices to play around with and
make really cool sounds.
For audio editing and audio recording, there are tons of amateur musicians nowadays
making affordable home studios just using a personal computer, a sound card, and audio
editing software like Pro Tools. And they call this a digital audio workstation. And you
can get really professional quality sound out of these just using all these audio effects if
you have a good microphone. I guess I should have put a microphone on there too. That
would be a part of the cost.
So the power of these digital audio workstations lies in plug-ins. These are just little
things you can buy that incorporate different effects, compressors to get the levels
balanced. Different cool effects like Auto-Tune. Is anyone familiar with Auto-Tune?
This is not auto tuning like a kernel, but Auto-Tune that T-Pain uses to make him sound
like a good singer when he's not.
That corrects your pitch. It's automatic pitch correction. So different plug-ins like that.
And these -- they're developed by third parties or they're included with the audio editing
software. And if we're going to parallelize these, we need to ask are they thread safe; that
is, are we going to get the same behavior out of them regardless of how they're scheduled
on the machine, or are we going to get a crash, are we going to get completely different
sounds out of these things if they're scheduled differently and there's different amounts of
delays.
When they're composed, are they going to cause catastrophic performance conflicts; that
is, is the performance just going to become unbearably slow when these are competing
for resources within the application.
And last, will they appropriately share hardware resources with other applications
running on the same machine. Because realtime performance is so important for audio
applications, particularly live performances, sharing hardware resources is really
important.
So to answer these questions, the Par Lab has been working on two little projects -- not
little projects, two big projects. This is -- the Tesselation OS is how they're envisioning a
space/time partitioning amongst applications.
And then down here is the Lithe, and that's a second-level within your application how
are you partitioning resources amongst the different libraries and system calls so that you
aren't computing for cache, destroying your cache when you're timesharing a core.
And for the OS -- at the OS level, another thing that music is going to need for realtime
performance is timing and deadline guarantees. So if you're doing realtime performance
and you don't meet a deadline, a timing deadline, you're going to get a click or a pop.
Also if two things are split onto different cores, they may end up being written to the
audio buffer at different times and you'll get a completely different sound out of it. And
I'll have an example of that in a second.
So the next point I want to make is that music is inherently very parallel. This is just a
score of different instruments playing. And you can see all the voices. It's like 20, 30
something voices all going in a very data parallel way.
But the important thing here is that synchronization, audio synchronization and timing
are very important. If these things aren't occurring right in time, the ear is going to be
upset.
So timing -- here's an example. It's going to be an audio example. Hopefully you'll be
able to here it. If we have a copy of a piece of audio and we process another copy of it on
a separate core and we miss a deadline -- that is, we -- the audio buffer needs to write to
the digital analog converter at a certain time -- and the other core doesn't meet that
deadline, it's going to be writing a frame late. Right?
So if we're just delaying another copy of the audio by 1 millisecond, we get this combing
effect. Anyone familiar with digital signal processing to see that this is like a comb filter,
we get these notches in frequency.
And just with the 1-millisecond delay, I'm going to show you an example of this audio
here [audio playing]. Noticeable. Did everyone hear the little difference there? It
sounds a lot more hollow, right? And we got this combing effect. It didn't just -- it
wasn't just distortion and clicks, it was actually like changing how the sound sounds.
And for people who are spending tons of time getting the sound just right on their
compositions, this is very troublesome to them. It's like the main reason that IRCAM sort
of abandoned parallel audio processing back in the day, because they couldn't get all this
timing right. Things would just not sound deterministically correct.
So one way around this is Open Sound Control. This is a standard developed at CNMAT
and IRCAM. And it's a way to kind of communicate audio performance data, sort of like
MIDI back in -- well, they use MIDI a little bit. But it operates over Ethernet.
And the important thing for us here is that when you're sending audio performance data,
you can include these high-resolution time tags. And these time tags can be used to
synchronize. If you'd say when this audio event occurred and it arrives at the audio
buffer, then you know that you need to delay something else, if it has been delayed, in
fact, so then you get the synchronization there. All right?
So next area, which is my area, I'm working on music information retrieval, and we call
that MIR. Also at MIT they have a machine listening group. We like to call it music
understanding, because we feel that's a little more all-encompassing than just information
retrieval. And that's sort of capturing the psychological aspect of a human experience
with music.
Up here in the corner, in the middle, is an audio waveform. And this is an example of
music transcription. You're producing a musical score at the top there from the audio
events. Also at the bottom, that's called a piano roll notation; it's like a
time-and-frequency or time-note based plot. That's one application of music information
retrieval, automatic transcription. We can also do source separation, which is isolating
different instruments for analysis.
Similarity, playlist creation. Playlist creation is I think the thing that most people can
wrap their heads around as being useful, just anyone who listens to music say I'm in a
mood for a certain Led Zeppelin song that it doesn't rock out too hard, it's pretty easy
listening.
But I want to listen to some classic rock like that. I'll give it this Led Zeppelin song and
find me other songs like this and make a playlist for me, right, instead of just saying,
okay, I'm in the mood for this Led Zeppelin song and then I end up listening to the rest of
Led Zeppelin's album.
And that's usually how most people listen to this stuff. Unless maybe you go to Pandora
or something, right? Pandora will do that for you, but at the expense of you're connected
to their server. And also all that information that they use for comparison is made by
humans. So people are listening to the music, checking off different aspects of it. Maybe
that's not the category you want to be comparing.
So also we can classify by mood, artist. Score following is important for automated
accompaniment. So if you want to play a solo instrument and have the computer
automatically follow along where you are in the score and play some computer
instruments along with you.
Also lyric synchronization. That's for automatic karaoke.
And last, song segmentation, splitting up a song into verse, chorus, bridge, et cetera,
because they may be pretty different parts, and you can analyze them independently then.
So the hope with all this technology is that someday you can query for music like this: I
like the drummer but I can't stand the singer. Find me something in the same genre with
drumming like this but with a singer that sounds more like John Lennon or Rod Stewart
or someone you really like.
And this isn't that far-fetched. This kind of thing can be possible in the next few years.
All right. So as a case study for parallelizing music information retrieval I looked at
drum track extraction. And this is a particular example of source separation where you
take an audio waveform in and output an audio waveform hopefully containing only the
drums or only the percussion.
And this is done -- first there's spectral feature extraction. You take out a spectrogram,
and in this case we're operating on about 20 seconds of audio at a time. Then
Non-negative Matrix Factorization, that's NMF here, is used to split up the spectrogram
matrix into time and frequency components.
And you can select the number of components you want in this. Usually I use about 30
components just in popular music. There's enough things going on at once that 30
components works pretty well.
And then you extract features from the components so that you can classify the
components as either percussive or not or drums or not; and then you resynthesize the
audio corresponding to those components.
And the important thing computationally is that 80 percent of the time in a MATLAB
implementation is spent on the Non-negative Matrix Factorization. That's the
unsupervised learning routine that is iterative and it takes a lot of computation.
And so it's -- in the MATLAB implementation, this is my optimized MATLAB
implementation, some naive implementations take about 90 seconds to run on 20 seconds
of audio, which is totally infeasible. We got it down about 23 seconds, which people
asking like how did you optimize MATLAB. It's MATLAB. But avoiding four loops,
using single precision when you don't need to use double-precision audio is very single
precision, the samples are 16 bits usually. That's how we got it down. Also not
computing things. Some people do some silly stuff in MATLAB.
Anyway, we got it down to about almost realtime, 23 seconds. But we're going to
parallelize this and try to get it down a lot using OpenMP for multicore machines and
CUDA for GPUs. Yeah.
>>: [inaudible]
>> Eric Battenberg: The SVM. Very tediously. I ran it on lots of columns of audio,
listened to the components. Drums, not drums; drums, not drums. I made a little
interface to do it really quickly. Yeah, I hand did it, about a hundred clips of audio, and
then I cross validated everything with pretty good results. The classification is 2 or 3
percent error rate actually.
Yeah. So here's some audio examples. And also you get a prize if you name the artist
and song. I'll buy you a beer at next year's apps workshop or something. And if you just
know an artist, then the person sitting next to you owes you a high five.
Let's go. Number one. [music playing] everyone knows who this is hopefully. Some
people probably know what this song is, right? Black Magic Woman.
Now, it's important to listen to the audio, to the drums in the original version. Most
people just hear Santana playing. If you're a drummer, you might have heard all these
drums here. Now, the drums aren't -- I mean, it did a pretty good job of separating out.
You don't hear the guitar at all.
The drums aren't the most -- they're not completely pleasing sounding because the way
the Non-negative Matrix Factorization works is that it has to smash out the frequency
components, all the contributions from all the guitar stuff. So that stuff can be
compensated for later. We didn't do any fancy signal processing at the end. This is just
the raw output of the Non-negative Matrix Factorization. But also this could be used for
drum transcription, other kinds of things.
Okay. Here's the next song. This will be different. [music playing] does anyone know
who this is? No one. An entire room. No? OutKast. Has anyone heard of OutKast?
They rap. They're a rap group. It's a different kind of music from Santana.
All right. [music playing] now, this is done completely automatically. So I didn't do this
by hand or anything.
[music playing] raise your hand if you know the name of this song. Raise your hand if
you know the artist. Four -- three people. All right. Led Zeppelin. They're a band too.
They're some kind of new band. They like just came out with this song. It's called Black
Dog.
[music playing] you can hear there's a little bit of distortion in this back drum and a little
bit of influence of the phase, on the phase from the singer's voice and the guitars and
stuff, but it does a pretty good job.
All right. Here's what -- kind of what comes out of the Non-negative Matrix
Factorization. This is an audio spectrogram here. It's kind of flipped upside down so it
looks like a matrix. But it's a lined with the drum score here. This is just only drums
here. So to get an idea of what comes out of this, factorize into three components. And
the three components -- these are the spectral contributions or whenever something
occurs in this row here we're positively scaling the contribution of this part to the
spectrum here.
And this is the time -- the time contribution of each of the components. And it's aligned
with the drum track here. So you can see all the hi-hat hits or lining up with the hi-hats in
the score and the bass drum and snare drum, et cetera.
>>: I've got a question.
>> Eric Battenberg: Yeah.
>>: When somebody hits the hi-hat, depending on how hard they hit the hi-hat, doesn't
that change the timbre and the frequency?
>> Eric Battenberg: It does, yes. So this -- the reason Non-negative Matrix Factorization
works pretty well for drums is that the spectrum for a drum is fairly stationary. How hard
you hit it, it definitely changes the timbre. And if you are -- this is a very canned
example because these are -- I just synthesized this using the same drum sound each time.
So the reason you use more than the number of instruments you have as the number of
components is because sometimes things sound different, right? So if he's going to hit
the snare drum really hard, the timbre will change; that will end up being a different
component. But as long as it's classified in the end as drums, as a spectrum that sounds
like drums or a time component that sounds like drums, it will end up in the drum track at
the end. Yeah.
>>: So different hits to the snare drum will be different instruments?
>> Eric Battenberg: Yes. That's true. Yeah. Hit it harder, hit it on the rim, hit it on the
side, different sounds. Yeah.
All right. So Non-negative Matrix Factorization is an optimization problem. And the
important thing here is that we're minimizing cost function. And instead of just using a
mean squared error for the reconstruction between the product of those two matrices and
the original matrix, we're going to use this divergence function. It's similar to the
Kullback-Leibler divergence. It's a log-based function. And this has been shown to work
well for music.
And the important thing here is these are the gradient-based updates that iteratively
minimize this. And this cost function here is not convex in both W and H. So we will be
arriving at a local minimum. And some people get around this by doing the optimization
multiple times and using the best result. But it is -- you do come up with a non-global
solution. But it still works pretty well most of the time.
Computationally, for the typical 20 seconds of audio is you're factorizing a 512 by about
3,000 matrix. And we do it into 30 sources. So for each iteration, this can be broken
down into about 400 megaflops of Single-precision Matrix Multiplies and 3.6 megaflops
of divides which are pretty slow for those of you who -- yeah?
>>: So megaflops is the number of flops, not flops per second?
>> Eric Battenberg: Yes. I usually use lower case when I'm -- number of operations and
all upper case -- what is the convention? Is there a convention for number of flops per
second? Okay. Yeah.
And then also important for parallel we have some sums here. It's not much work, but if
you're going to parallelize this onto a lot of threads, it requires a lot of communication to
do sums in this way. And also we do some other things here. Compute a log-based cost
function every 25 iterations, which is pretty slow, because logs are slow.
So how can I communicate to you what this app involves, what the important design
considerations were in this application. How can I come up with some jargon that will
help you understand what went into this and also communicate to other application
experts, like Jim Demmel, what I need out of this application.
And one thing we're working on -- this is sort of a plug for something we're working on
at UC Berkeley, is this parallel design pattern language. And we call it OPL. It's just a
working name, Our Pattern Language.
But it's hierarchical. And you start up here with -- I don't know if you can read all this,
but, for example, dense linear algebra is one of the computational patterns here. And
depending upon what architecture I'm parallelizing this on, this may be decomposed into
different patterns down here. So dense linear algebra. We're doing some matrix
multiplication, we're using this pattern.
If we're doing it -- if we're going to do a block decomposition, we can use the geometric
decomposition pattern, later on maybe distributed array pattern and some SIMD. So
these patterns will be decomposed hierarchically.
And the importance of this is to help communicate to new parallel programmers or just
programmers in general who are starting with this sort of high-performance computing
jargon here about the best practices about -- I mean, because HPC has been -- parallel
computing has been going on for a long time in the HPC community, only with
applications developers has it been starting out more recently. Gives us a common
terminology and helps guide the design process.
>>: [inaudible] next week actually we're having a joint workshop with [inaudible]
Illinois and Berkeley to discuss this pattern language right after the Par Lab retreat. So if
any of you guys are staying -- are coming for the Par Lab retreat, you can stay and visit.
>> Eric Battenberg: All right. So, for example, just this leg of the iterations updating the
H, which is the time contributions, we can decompose this into this little block diagram
of matrix multiplies, element divides, and then column sums. And then when we
decompose this into the pattern language, we start with this high-level compositional
structure, just pipe-and-filter, basic consumer/producer architecture. And within that we
have little blocks that are made up of matrix multiplies, sums, and element-wise
arithmetic, which is completely data parallel and sort of naive here.
But, for example, in the sums, we apply the MapReduce pattern, which is just an example
of a way to do a reduction. Sum is a particular example of a reduction. And then in
CUDA, which we are targeting with this particular implementation, we use the graph
algorithm's patternings, which is we're doing the reduction using a binary tree. So we are
traversing this binary tree in a particular way. And the pointers and the graph algorithm
pattern, it's the actual document. Each of these corresponds to an actual document
written about the best practices and pointers to helpful resources for programming these
things. That can be decomposed all the way down to SIMD and collective
synchronization on the CUDA architecture.
So hopefully you have a better idea of like what goes into this, all the computation here.
We have the matrix multiplies, sums and element-wise arithmetic.
Okay. So for the OpenMP implementation, this was really easy. I did this like in 30
minutes. And I took my sequential code, wrote some reduction routines, wrote some data
for parallel four loops. Who's written OpenMP here? Raise your hand. Everyone. No.
Not everyone. A lot of people. Who's read about OpenMP? Like who's seen this
example for OpenMP? Okay.
So this -- all you need to do is add one line. And if your work is naively -- is
embarrassingly parallel, it will just divide the work up amongst course. So you add that
in combination with this reduction version of that, you just define the reduction operator
edition and the final variable that you're going to be summing all the partial sums into
and each of these threads can compute a partial sum.
So that's the OpenMP I did for this. And, yeah, it's really impressive, right? So we use
Intel's Math Kernel Library for all the matrix multiplies, which has -- uses OpenMP
under the hood, so you can control the number of threads there. And we show the
performance scaling on a dual-socket Nehalem machine. It goes from 11 and a quarter
seconds down to 14 threads, 2.6 seconds. And we get like about a 2 times speedup going
from one to two threads.
But as we get to 14, we get to the minimum, it's definitely nonlinear scaling. So this
probably won't scale well on the same architecture, on using the same programming
model. Though we did get a 7 times speedup compared to the MATLAB
implementation, which down to 2.6 seconds from 20, that's getting pretty good for 20
seconds of audio.
All right. So CUDA, I'm not going to go -- who's actually programmed CUDA? Okay.
That's good. I'm not going to like get into this. You've probably seen this before. But
the idea is you're running lots of threads in CUDA. And threads can be grouped into
thread blocks. And within thread blocks you can share memory, and that's how
communication is done.
Physically, these threads are executed in groups of threads of 32 called warps. And if all
the threads within a warp all do the same thing, then we get SIMD. So if threads within
the same warp are not doing the same thing, this is called a divergent thread. And that
means that these threads are going to be executed sequentially, and we lose the SIMD
speedup.
So down here we see an example of a CUDA kernel which is run on the GPU, and this is
just accomplishing element-wise addition. And each thread here just does one addition.
So you launch the number of threads, which is the size of your array, and you call this
kernel down here using B blocks -- B thread blocks of size N each. That's how you tell
the device how many threads to use. And each of these threads operates on a particular
element, computed from its thread ID and its thread block ID and everything.
So luckily I didn't have to program a matrix multiply in CUDA. That was taken care of
for me by one of Jim's students [inaudible]. He came up with this really fast
implementation that achieves 60 percent of peak on the GTX280, about 370 megaflops.
So I found also that padding by matrices to multiples of 32, because that worked pretty
well, it decreased the running time by about 26 percent.
Also the element-wise arithmetic, pretty naive. It was similar to the example on the
previous page. The hardest thing to program, though, because this wasn't provided in any
libraries or anything, but I did find some good pointers in the CUDA SDK about how to
optimize reduction, parallel reduction here.
So parallel reduction is accomplished in CUDA best using a binary tree traversal. So
each thread copies memory to shared memory, local shared memory from global
memory. And then we do a series of two element reductions down until we arrive at the
final sum.
Now, there's different ways of organizing this, different ways of assigning threads and
how they operate on different elements. And the optimal way of doing that is to assign
threads so they're adjacent to one another and assign them to have a strided memory
access pattern. And you can read about that in -- if you're interested, in the CUDA
programming manual.
But the important thing here is that all the sums are adjacent. And then all the threads
that aren't working are adjacent as well. So you're avoiding divergent warps here because
here every other thread is not working that's not SIMD.
All right. And then applying these optimizations, like loop unrolling. So reorganizing
this tree, we're moving down, we're increasing the amount of -- the number of
optimizations. Reorganizing the tree, we got to here, we apply some loop unrolling.
And, last, running 30 sums concurrently. So there are 30 different sums in this
Non-negative Matrix Factorization of length 512. Now, those are very long sums. They
aren't going to give us much work to parallelize. So if we run them all concurrently, we
get all the way down here to about 3 milliseconds from 155 milliseconds. So that was the
most important optimization. Yeah.
And that's -- that's a lot of work just for a sum. But it paid off. And you can see the
results we get, our original optimized MATLAB implementation goes from 18.5 seconds
all the way down to .6 seconds. And it's like a little over 30x speedup. We're doing 20
seconds of audio and .6 seconds, which is pretty good.
And the OpenMP implementation, still, pretty good speedup. 2.6 seconds for 20 seconds
of audio. And you can see that the SGEMM, the matrix multiplies is like the main
contributor -- the main thing that achieves speedup here. I don't know why the
MATLAB takes so long to do element divides. That's kind of silly.
So yeah. The important thing here, CUDA achieves a lot higher performance, OpenMP,
though, a lot less work. So for people who do programming music applications who
aren't expert programmers, CUDA is going to be a bit of a turnoff. When you introduce
inner-thread communication through shared memory, CUDA gets a heck of a lot harder.
So CUDA would only be feasible for computational kernels that require a lot of
performance. And also for audio, there's this question of we haven't done this yet, but
processing audio in real time, going back and forth between the GP, are we going to be
able to achieve the latency required. And we're going to be releasing Python modules for
the music information retrieval community.
So to finish up, this fun brainstorm idea, combine music information retrieval analysis
with gestural processing or just interaction with computers to come up with like a custom
music mix or a custom music performance. So this is applicable to automatic mashups,
which is the buzz word now for taking things that you didn't write and putting them
together and saying you did.
[laughter]
>> Eric Battenberg: Gestural music selection, so you can imagine like using that cool
touchscreen thing at the spitfire last night, you have that on a wall and you say I don't like
this song, and based on all your -- so based on all your preference -- personal preference
data, an audio database in the cloud of music and some music information retrieval,
maybe we could navigate a personal space of songs we like.
So let's say we're at a party and the song is just -- is too loud or it's too -- it's getting
everyone too rowdy and you want to tone it down a bit, you can kind of traverse -- you
could traverse this space, come up with a point where it fits the mood correctly and just
stop there. So maybe seamlessly moving between different songs until you get to a point
where you think it fits your mood the best and then just leave it there and let the computer
decide what fits that mood best.
And you can combine all this music information retrieval with gestural processing, cool
interfaces, and just come up with a -- you know, a custom audio mix. And this can be
used for music performance, if you want to accompany yourself with different
instruments pulled apart from different pieces of music and resynthesize later as sort of a
mashup, or just for music listening, interactive music listening, where you're controlling
your own musical experience rather than just clicking on individual songs. It can be very
much personalized, and you can have as little or as much interaction as you want.
All right. So to wrap up, there are lots of music applications and both for musicians and
for music fans, listeners. And parallel computing enables new applications. That
previous example here will definitely be enabled by parallel computing, you know,
categorizing tons of audio data, processing these complicated Multi-Touch sensors; that
will all be enabled in the future due to parallel computing.
But synchronization is really important as well. Because timing is so important to the
human ear, we're going to need to put some emphasis on that for realtime music.
And parallel design patterns. It's a cool way to organize -- to organize and communicate
your code. And I encourage you to at least look into that as a way to instead of -- as a
way to communicate the computational needs that you have.
And, last, are there any questions? Yeah.
>>: You talked about some of the experimentation with processing being done in
MATLAB. Have you ever tried their parallel toolbox?
>> Eric Battenberg: I have not. The kernels that I wrote -- the computational kernels
that are being called in MATLAB here, matrix multiply, FFT, those kinds of things, are
already parallelized in MATLAB, I guess. So when I quoted this number, the number
back there, that was the two threads on a laptop, right, MATLAB automatically uses -- I
don't know what it uses under the hood, but it's calling BLAS routines that are already
parallelized. I haven't used MATLAB's parallel processing toolbox. I don't ->>: [inaudible]
>> Eric Battenberg: Yeah. I okay. [inaudible] BLAS.
>>: [inaudible] MATLAB's parallel toolbox and then found it rather frustrating because
the way that it works is it spawns a new process. You can't actually spawn a thread. It
spawns a new MATLAB process and you're limited to only using four of them. So I
have a Nehalem that has eight threads but I can't use them because MATLAB's licenses
are [inaudible].
>> Eric Battenberg: [inaudible] license for every core.
>>: [inaudible] about enabling domain-specific languages or languages that are more
natural to the domain. So I was curious if there's any help at that level of abstraction. It
doesn't sound like it, does it.
>> Eric Battenberg: No.
>>: There's some baby steps, but I don't think it's quite there yet.
>> Eric Battenberg: No, not in MATLAB anyway. MATLAB is -- yeah. Licenses in
MATLAB is too proprietary. I think a lot of people are moving -- in the music
information retrieval community are moving towards Python actually, because it's open
source, it's free. And people are writing cool modules. I'm going to be releasing Python
modules that you can call from within Python to run CUDA code or run OpenPL versions
of this -- OpenMP versions of this.
So yeah. Any more questions? Yeah.
>>: Max/MSM is a domain-specific language?
>> Eric Battenberg: Yeah. Max/MSP is a visual dataflow language that we're working
on parallelizing right now. And a lot of musicians use it. I don't know if it's the best for
music information retrieval, but for live audio performance, definitely really good. Yeah.
>>: So when you're doing the Non-negative Matrix Factorization, do you get away with
doing that once per song or do you have to do it about every second [inaudible]?
>> Eric Battenberg: You do it every 20 seconds, or it works best when you're working
on separate like chunks of audio. So 10 or 20 seconds of audio. So for an entire song,
you have to do it a few different times.
>>: You can't just do it once in the beginning and use the same type of performance for
the rest of the song?
>> Eric Battenberg: You could, but the reason you want to split it up is because
instruments change throughout a song. If the song was completely homogeneous -- the
same drums, the same guitar the entire song -- it would be easy to just do it once. But
that would be a pretty boring song.
>>: All right. Well, let's thank our speaker.
[applause]
>> Nelson Morgan: Okay. That's a tough act to follow, actually. It's tough to follow
music with boring old speech. But hopefully I won't make it too boring.
You see the title. And I have an admission to make, which is I don't actually care about
parallelizing speech recognition. I care about making it better.
But it turns out that there are at least some possibilities, I think, for making it much better
by using a lot more computation. And since the way the world seems to be that to get
more -- lots more computation in the future, we need to parallelize, I'm interested.
I'm Morgan. Adam Janin is here, and he'll be giving the second part of the talk. Jike
Chong couldn't make it. He did a lot of the work that Adam will describe later on in the
talk in addition to Adam's own work.
And of course we're from Berkeley. We're also from the International Computer Science
Institute, closely affiliated institute a couple blocks off campus.
So my fantasy is that we can make speech recognition much better by using a lot more
computation and having many streams on many cores. And I'll explain what that means.
I have some results from that. And also part of what we use to get this is some neural
network components. And Adam will be talking about that along with the work that Jike
did on parallelizing a complete unistream system. And unistream is just a one stream
system. And, again, shortly I'll explain what I mean. Oh, in fact, right there I'll explain
what I mean.
This is just kind of a generic picture of the components of a speech recognizer.
Everybody uses something like this. Of course there's lots more details inside of each
box. But basically from the speech people compute some kind of signal processing
variables. Those variables are passed to something that estimate the probability of
different speech sounds being responsible for these variables that you have locally; that
is, say a hundred times a second. Little 20- or 30-millisecond chunks stepped along
every 10 milliseconds.
These are going to give you probabilities, but the way that they are able to give you
probabilities is that there are acoustic parameters which are -- which come from models
that have been trained on a lot of data offline. And these local probabilities are then fed
to another box along with prior information that you have about the nature of the
language such as how probable it is that a particular word is going to follow two or three
or more words. And the composition of a word in terms of those speech sounds that
you're getting the probabilities for. And at the end you get a stream of words.
So that's your basic system. And if you've got lots of data to get all of the training
information and your testing situation is roughly the same as your training situation, this
actually works reasonably well.
So that's sort of what I say in this top bullet here. Speech recognition works pretty well
under good conditions with plentiful resources. So if you have lots and lots of training
that corresponds to the way you're going to test it and there's not a lot of noise or not a lot
of reverberation, you can do pretty well. Depending on the task, you can do down to a
few percent.
We recently did a Mandarin broadcast news task and once we got enough data, over a
thousand hours of data, we got the broadcast news down to a couple percent. On the
other hand, Mandarin broadcast conversation, even without any particular noise and
reverberation, it was more like 10 percent.
So you do get much poor performance for some conditions. And particularly if you get
some combination of noise and reverberation and relaxed conversational kind of speaking
style, you can easily get over 30 percent of the words wrong.
Yes, sir.
>>: [inaudible] percent acceptable?
>> Nelson Morgan: Well, it depends on the application. But there's -- human beings
generally don't do particularly better than that. But if you actually look at results of, for
instance, just recognizing numbers by human beings in a call center, after lunch it's
terrible. After coffee, it's pretty good.
>>: Even 30 percent is acceptable for some applications. For example, information
retrieval, 30 percent doesn't have a whole lot of effect because most of the speech
recognition errors are of versus the, in versus on, things that are on the stop list for
information retrieval anyway.
>> Nelson Morgan: Right. So 2 percent could be not good enough if you were doing
something stupid like trying to control a scalpel.
>>: Firing missiles.
>> Nelson Morgan: Firing missiles. Right. Yeah. Then 2 percent is not good enough.
But, on the other hand, as Adam's saying, if you are doing voice search, maybe 30
percent would be good enough.
But there's certainly a range of 57 cases where you'd like it to be much better than 30
percent. And you do have those kind of numbers for a lot of datasets.
So what people have already done is made use of several different signal processing
methods, not just single one. In fact, a number of the best systems have two or three
different kinds of ways of looking at the speech data, and they incorporate them in
different ways, and they do pretty well.
But in general people have found once you have two or three of these, if you go to a
fourth or fifth or a sixth, you get diminishing returns. So I've been pursuing sort of the
counterview, which is maybe there's a way that you could use a huge number of them and
actually get better robustness. I'll show here, this is basically the same picture, except we
have different signal processing modules.
And the reason why I think this could help is that it could be that -- say, if you had
thousands of these, maybe you only need to do three or four, but in a different situation,
acoustically and in terms of speaking style, et cetera, it might be a different three or four.
So where this fits into parallelism? Well, the front end that I'm imagining is
embarrassingly parallel. Now, as the front end is now -- that is, what I'm calling the front
end is the signal processing's chunk -- saying that you're going to parallelize that to
anybody who knows this field would sort of say, yeah, why bother, because maybe it's 10
percent of the computation. But I'm talking about expanding it out and making it lots
more computation, in which case you really would need to parallelize it in the future in
order to get the throughput.
And then it would maybe be 90 percent of the computation instead of 10 percent.
And I just added this yesterday. I realized that there's a little bit of analogy with the
multiview Photosynth stuff that was discussed yesterday morning, that you have many,
many different views of the same scene. The difference is that, at least as it was
described yesterday, this was in order to give you the possibility of viewing it many
different ways, constructing the model, and so forth. And here the major reason for it is
that some of these views will be obscured and you want to take advantage of the ones that
aren't.
So here's a picture of how we would expand out the front end. Right now what people do
is compute something like an auditory spectrum. And what you mean by auditory varies.
But essentially it's a time-frequency plane with a few auditory properties. One that
everybody uses is that there's less resolution at high frequencies than at low frequencies.
But you might have some other auditory properties in there.
And then you split up the data. So there's a factorization discussed in the last talk. That's
one way you could split up the data. The other is just sort of all over the place. Do
different Gabor filters, much as is done in Vision, with different kinds of scales and range
and so forth.
And then these -- the output of these Gabor filters go into multilayer perceptrons, and
these are trained descriptively. So this accomplishes -- they accomplish two things: first,
since they are trained descriptively, between different speech sounds, they give you
features which are somewhat descriptively -- which are somewhat discriminative; they're
pretty good at discriminating between different speech sounds.
And the second thing is they output posterior probabilities of these different speech
sounds. And the nice thing about posterior probabilities is that there's all sorts of cools
ways to combine them.
So then the last box -- I'll come back to the second to the last box, but the last box over
there is just some transformations that we do to make them nice features for Gaussian
mixtures to model. Basically make them roughly orthogonal and roughly Gaussian.
So the big questions -- we sort of knew how to do all this. The big questions come in
here. Because, again, I think if you just combine in a naive way a thousand different
streams or a hundred different streams, it's probably not going to work too well. And so
you want to do something that perhaps selects or at the very least strongly de-emphasizes
a lot of these streams.
But the main thing -- the main point I want to get at here is that when I -- if I was trying
to do recognition on this speech, I wouldn't want to have to have come in here before and
trained up on a thousand hours of other people speaking in this room. You don't have to
do that. If you've never been in this room before, you came in here, and you're still able
to understand me. Hopefully.
But that's where we are in speech recognition now, is that we have to collect huge
amounts of data to go for a new genre, a new language, a new -- just a new room. And
that's a bad situation. So the idea is to have lots of different ways of looking at the
speech, and then to be able to switch between them or combine them in some good way.
So I think it would take a longer talk to explain why I think this would work, but the
practical engineering thing is that we've tried some of these ideas out and at least on a
small scale they do work.
This is a set of results for numbers recognition. Numbers is not just digits. Numbers
includes things like 16 and 60, 6, 16, 60, which are all pretty similar. Our best results are
sort of under 3 percent with a sort of standard system. If we had some of these streams,
we get a couple percent, which is getting rid of all a third of the errors. If we add lots
more streams, we clearly don't know how to do that quite right yet, at least for clean data,
it does get a little bit worse.
But the noisy case, we made this much tougher. We added all kinds of noises with signal
noise ratios varying from zero to 20 DB signalized ratio. And we end up -- and on many
different -- it's not white noise, it's many different natural noises. Speech babble, inside
of an F16, inside of cars, all sorts of things. And you end up worsening by quite a bit, by
a factor of five. And we get rid of about half of the errors. So that works pretty well.
And, by the way, we want to point out this is a postdoc of mine, Sherry Zhao, who did
this work, and she's currently extending from 28 streams to a hundred streams.
So least for the noisy case, which is kind of what we're trying to get at, since for many
applications all of these numbers would have been okay for clean, but probably not so
much for the noisy case ->>: So the back end for all of those are the same, SRI [inaudible].
>> Nelson Morgan: Yeah. This is SRI's -- yeah, we're using SRI's decipher, which is
pretty advanced.
>>: So you don't have to change anything in the back end [inaudible].
>> Nelson Morgan: We haven't changed anything, no.
Then we went to a larger task. Still this case it's not noisy. It is -- some of it's
conversational data, which is called broadcast conversation. So it's not that simple. The
MFCC baseline is about 25 percent and is about 22 percent if we add a few streams. So
we haven't gone up to more streams yet. We haven't added noise yet.
This is one of my students, Suman Ravuri, who's doing this. But it does -- again, this is a
clean case so the improvement isn't as dramatic. It's about 16 percent. You generally see
a reduction in how much cool techniques help when you go from a small task to a large
task. But it's still there. And I think if we were looking at a noisy case, it would be much
bigger.
Okay. So how does this relate to parallelization? We're going to expand the easy -- the
easily parallelizable stuff but computationally dominant parts, at least in this
many-stream image, the feature streams multilayer perceptrons, which take a lot, because
these are, by the way, big multilayer perceptrons. They can be millions of parameters for
each one.
I should also mention -- although I'm talking here about recognition, one of our big
concerns for parallelization is training. Because right now it takes us a couple of weeks
to train one of these MLPs. And we'd like that to be an overnight.
And the Gaussians for the statistical engine, those are all pretty regular and they're pretty
straightforward to parallelize but very important.
Then there's parallelizing a full unistream system, and this is Jike Chong's work that
Adam will talk about in a moment.
But ultimately what we want to do is work with a full application, which we haven't done
for Par Lab yet. And of course the reason why we want to do that, take, you know, full
application, is because we want to watch out for [inaudible]. There may be other things
we're just saying, oh, that's not very important, and they could kill us.
So we want to ultimately end up with a demoable version of a complete meeting diarizer
or browser. This is a picture of one that was done as part of a European Union project
where you have time moving along here and you have different speakers, so you're
separating out by who's speaking when, and you have a transcription there and then
there's other notes that can go off to the right such as summarizations. When the
Europeans did their collections, they also did video, and so there's other -- video can be
attached to it.
So these things already exist at a research level. But, again, the performance is at this
point still not all that great. We'd like to improve that and also be able to handle more ad
hoc meetings that might occur in noisy or reverberant environments.
Okay. So, again, just to give credit, Sherry Zhao and Suman Ravuri are working on this
trying to make the things better. At ICSI Adam and Chris Oei who works with him work
on making it faster. And Jike Chong and Youngmin Yi, who are Kurt Keutzer's students,
are working on this full unistream system.
And that's my part. Oh, yeah, sure.
>>: In the work you described, there's one back-end recognizer that's running on
multiple streams.
>> Nelson Morgan: Yes.
>>: Is there a reason for not running multiple back-end recognizers each on one or a
subset of the streams?
>> Nelson Morgan: Well, I think the notion is to do the combination as early as possible
so that you have as good of information to pass on later as possible.
I mean, in a real system, if you were facing -- something I know you love, DARPA kind
of deadlines, you might want to partly do some of each. For instance, in SRI's system,
they take the two- or three-stream system that we have already and they take another
system that's unistream and they cross-adapt so there's -- there's combination here and
there's combination there, then they take their system and combine it as a rover-level with
[inaudible] system. And so in a system where you don't care about how much junk you
throw in there, you'll probably do all sorts of things.
But I think the thing that we're pursuing, not that another method wouldn't work, is a
notion that's more like what we think is going on in -- I didn't talk about this -- in
[inaudible] auditory cortex, primary auditory cortex, where there appear to be many,
many different representations. So the brain does something with that. I don't know
what goes on after the primary auditory cortex.
>> Adam Janin: Just to add to that, though, the other reason also is that for Par Lab we're
more interested in online and real time, whereas for the computation systems, it's all
batch encoding. And with a batch you can wait until you're sort of at the end of a new
show and you have an entire utterance -- an entire hypothesis that you then combine with
another system's entire hypothesis. It's not clear how you combine at the decoder level if
you're trying to do it online, realtime online.
>>: [inaudible] so when you expand for one stream to 100 stream, do you expand the
auditory spectrum or do you expand the number of Gabor filters there?
>> Nelson Morgan: We expand the selection of Gabor filters that go into a particular
stream. So one thing that's probably a little wrong about my picture is that each of the
streams actually consist of some collection of Gabor filter outputs. And so you can easily
expand out to a thousand given the combinations of how wide you make them and how
much of the spectrum corresponds to what Gabor filter and so on. But right now 28 to a
hundred is all we're doing.
But the auditory spectrum, it's just the base that we work from. And then you take
chunks of it and run Gabor filters. Typically -- initially we run the same Gabor filter on
all parts of the spectrum, but you could change that.
>>: I thought in auditory research people typically think about having different aspect of
[inaudible] like a rate synchrony, all these -- there's a lot of [inaudible].
>> Nelson Morgan: So I sometimes think of auditory in this context as meaning
whatever people haven't implemented yet. Right? In real systems. Right? Because POP
is auditory, but everybody has POP, so they don't think of it is auditory anymore. So I
meant this in just the broadest possible sense that there's some kind of time frequency
plane that has some aspects of auditory ideas in it.
[inaudible] in his stuff has tended to put in more than we're doing right now. Right now
we're just taking a log nil spectrum more or less.
>>: [inaudible]
>> Nelson Morgan: But I'm open to the idea that maybe putting more auditory aspects in
there might make things better. I have no idea. Okay. Adam.
>> Adam Janin: So I'm Adam Janin and I'm going to talk about some other parts of the
system. I'm going to start with the multilayer perceptron, or neural network.
So the neural network is a machine learning algorithm that just maps some floating point
inputs to some floating point outputs. And in our case we're mapping speech features of
some sort, you know, for those of you who are familiar with [inaudible] spectral features
or POP features and so on. And we're trying to map that to speech sounds, is it a "ku" or
is it a "tu." We do that very 10 milliseconds.
Now, in a picture like this, with one neural net, even though our neural nets are quite
large, you know, 16,000, 32,000 hidden units, as Morgan said, it doesn't take up that
much compute time. So the reason we're interested in parallelizing it is first to handle the
sorts of many, many streams that Morgan talked about in his part, but then also during
training we train on thousands and thousands of hours, and it can take a month to train a
big network. And we'd like to cut that down so that we can run many more experiments
and try different types of neural networks and different types of features.
So what I'm showing here to just describe what the computation looks like is we have an
input feature which might be, I don't know, 39 features, each one is connected by a
weight which is just a floating point number to a hidden layer. And so the operation, the
mathematical operation here is a matrix times a vector. And then you perform a
nonlinear operation at the hidden layer and then you repeat to the output layer.
In a real system it's a little more complex than that because we batch up all these 37
features into one frame, and then we include in that not just the current time but the four
future times and the four previous times. And so in this picture here, each one of these is
the 39 features and we have 9 frames of them. But it's still a matrix times one long
vector. It's just the vector is much longer now, because it's looking at a longer time
frame.
And we do that because the way you say a speech sound depends on the context it's in.
It's not just the currents, how you say it is not dependent on just the sound you're trying to
say, but the sound previous and the sound next.
In addition to that, we also for efficiency reasons, take up up to 10 seconds of these and
batch them all together into what's called a bunch. And that allows you to do a
matrix-matrix operation rather than a matrix-vector operation. And that's of course much
more efficient.
At runtime the cost of doing the grouping -- the bunching, rather, is latency. You have to
wait until you've gathered them all up. However, that's not really too much of a big deal
because you typically have some latency introduced waiting for a word to finish anyway.
And so introducing a few seconds of latency at this point isn't a real problem.
The bunch sizes change the training in that you're now optimizing for something slightly
different than you would if you did it one at a time. But in practice you get the same
results either way. It's not very sensitive to that. And the matrix-matrix operation is
much more efficient.
And so in terms of parallelization, we already have an existing optimized multicore
implementation called QuickNet that was developed at ICSI years ago and has been
parallelized sort of over time for existing cluster computers. And we used that quite a lot
at ICSI, and other places as well have used it.
We've also developed a CUDA implementation. And it gets about three times the speed
that QuickNet currently gets. I don't have a uniprocessor number on what the speedup is
over a single thread, just because we just never run it that way anymore. But I certainly
could if we -- if people were interested in what that is.
Now, once we optimized the matrix-matrix multiply, the computational [inaudible]
actually ended up being a combination of assembling the data for the context windows
and the nonlinearity. And so we addressed that by doing a fast approximate exponential
on the 8 core version, and then fortunately CUDA has a built-in paralyzed exponential,
and so we use that on CUDA.
And so with those optimizations, we're now compute bound for any large net, which is
what we typically run. So, you know, as long as you're running a net with thousands of
hidden units, we're completely computation bound on these platforms.
The next topic is something we're working on currently called a tonotopic network. And
it's called tonotopic because we organize the features roughly by tone. So these features
down at the bottom are low frequency. It's the how much energy is there in the low
frequency part of the spectrum, and the higher up ones are how much energy is in the
higher parts of the spectrum. And we organize the neural network in a structured way so
that we're computing directly on energy on frequency bands.
So the reason we think this might work is, for example, imagine you have band-limited
noise where at one frequency band there's something going on that corrupts it. The
network could learn that and not get -- be interfered with other -- the clean part of the
spectrum.
Additionally, we've also found that combining things like this with other systems
typically pretty much always works. So just having something that's a little different
from the others -- from the other systems that you have gives you better accuracy. And
as you can see, the structure of this is going to be a little different in terms of the
computations. First of all, we've added an additional layer, but that's just a dense matrix
multiply again.
In terms of the input layer, then, the structure in the input layer -- I'm sorry, I should have
stayed that way. One way of implementing this would be as a block diagonal matrix
where -- the ones where the no connections are just zeros. But that's not very efficient.
So in fact what we end up doing is separate matrix multiplies for each grouping. And
that's just a more efficient way of organizing it. And then also there's the context window
issue that I described before, which makes just the sort of organization of the data here a
little more complex. You have to move things around a lot more. And that just takes a
little more time.
So we have already an existing single core version that we used at ICSI that got written
by graduate students that graduated last year. And we've been working on the multi- and
many-core implementations, are currently in progress. And we expect similar speedup.
It's probably not going to be quite as good because you have slightly smaller matrices and
doing several of them, but it should be pretty close.
So next I want to go on to talking about the work of Jike Chong and other people -- other
students at Par Lab on doing parallelized decoder. So this is the component that once all
this signal processing stuff has been done and you have some features, it theoretically
does a search over all possible utterances in your language and outputs the one that
matches your data the best. Of course, it can't do every possible utterance; that's too
many. So it does a huge search, a huge, pruned search.
And in typical single-threaded -- singling-stream ASR systems that's the computational
bottleneck is this search. And so it's useful to parallelize it so that you can get realtime
response and just get a faster system response.
The other components of this, the pronunciation lexicon, the acoustic parameters and the
language model were generously donated by SRI for this project. They've been
simplified a little bit from what we would run as a full system, both to fit in memory,
because we have some memory constraints on the GPUs, and also because again we're
concentrating more on online and realtime systems.
So the full-blown SRI system does many, many, many passes over the data and sort of
assumes you have the entire utterance available before you do any processing at all and it
does adaptation and all sorts of other things.
And so for this project, we're using simplified systems with fewer parameters and so on.
So that's just a little different.
And then the work involved exploring a design space of how to design the decoder for
different architectures, potentially for different architectures.
The inference engine itself is a highly hierarchical structure. The outer loop of it is
organized times synchronously. This isn't the only way to do it. Other decoders do it
differently. But basically we're moving forward 10 milliseconds at a time. And so you're
looking basically at time synchronous guesses at what the stream might be and updating
them all at the same time. And that's what the outer loop looks like.
The next loop is -- consists of three phases. The first phase is basically the computation
of the acoustic models. It's a Gaussian mixture computation that you're doing for all your
current hypotheses. And this is completely compute bound and embarrassingly parallel.
It's -- well, not embarrassing. It's a little more complex than that. But it's reasonably
easy to parallelize, compare to phase two and three, which involve keeping track of your
updates. And, again, because you can't search every possible hypothesis, you have to just
keep track of the best ones. And to do that the hypotheses have to communicate with
each other to keep the updates in order. And I'll talk a little bit about that in another slide.
The hardware platforms that we ran this on are a multicore conventional and then also the
many-core NVIDIA platform. And I think people are probably familiar with these
architectures already. But you can see what the sort of performances look like.
And so this slide is sort of the punch line of the best system, and I'm going to go into a
little more detail later. And it's a breakdown of the time spent in the compute intensive
phase versus the communication intensive phase. So basically the whites are the phase
two and three and the black is phase one.
And as you can see, the speedup varies as you go from the sequential implementation to
the multicore and the many-core. All the phases are speeded up, but the communication
phases are sped up less than the compute intensive phases.
And so this does indicate that without some additional effort we'll probably start hitting
some diminishing returns on the communication phase, since we're already up to 50
percent in the many-core case.
And then in terms of why the inference engine -- Jike calls it an inference engine. We
typically call it a decoder. Same thing -- why it's challenging. The way that the decode
runs is on the graph called a weighted finite state transducer. And the way you should
think of this is it encodes the sort of words you can say and the patterns and their
pronunciations.
So this might be, for example, op-tics, and this might be F-ics. So it's sharing the suffix.
And this might be optometry. And you're proceeding through that as a hypothesis. You
can think of a particular hypothesis as just being a spot in the graph that encodes the path
it took to get there.
And the update step is just to move one step forward in the graph, either going to the next
speech sound or staying in the speech sound you're in.
And so the graph is not regular as you can see, it's just guided by what your dictionary
looks like. And then where your hypotheses happen to be on the graph depends on what
happens at runtime, how you say the words, what the acoustic environment is and so on.
And then of course there's the problem of synchronization between the phases as you
update the hypotheses.
So we investigated a particular design space on -- for the traversal, and for the
computation, and this slide describes that. So the first choices are sort of how you bunch
up hypotheses to go to the different cores. You can either organize them so that you're
going over arcs or you can go over nodes. And of course there are a lot more arcs than
there are nodes.
And so what you could expect is that as you get more SIMD resources, more parallel
resources, in all likelihood the arcs implementation is going to be better than the states
implementation. But it does depend on how many threads you have -- excuse me, how
many -- yeah, how many independent threads you have.
And then the -- the other design axis is on traversal methods. And this is basically how
you communicate and update the current hypotheses when you're moving from one state
to the next. And you can either do that by propagation, which basically depends on some
hardware locking, or you can do it by aggregation into the final states, which in the
current implementation is done with some software locks.
And so we actually develop four different algorithms and ran them on the different
platforms to see which one would work best on each platform. And so this slide basically
shows the breakdown in the different phases for the sequential implementation, and then
here for each of the other implementations.
And basically along this axis, if you have platform atomics, you do better in this axis, and
as you have more SIMD you do better on that axis.
And so on the multicore platform, these are what -- the results we had. And as you can
see, you definitely want to do the propagation rather than aggregation on the multicore
setup, but there's not a whole lot of difference between the arc based and the state based,
because you don't have a lot of SIMD lanes. And so it's not a huge advantage to do one
over the other. And so state-based propagation won by a little bit over the other
algorithms.
The picture is slightly different on the many-core, where the breakdown looks more like
this. And as you can see here, the arc-based propagation won by a lot. And you're still
better off with propagation because both platforms have good hardware locks, but on the
many-core architecture you have a lot more SIMD resources. And so there's a much
better win on using the arc-based propagation.
And I think that's all I want to say about the decoder.
And then we just have summary slide. So as we talked about before, speech recognition
works pretty well under some conditions but still can fail quite badly if you have noise, if
you have accents, if you have unexpected conditions in various ways, human-to-human
communications.
And by massively expanding the front end, we hope to address some of those issues to do
better in noise, to be more robust to unknown situations, to be better with less training
data.
And then the Par Lab approach has helped quite a built on making the individual
components much faster. And so the -- enabling us to do a lot more experimentation as
well as to run in real time.
And then the final point is how can Par Lab tools enable more experimentation. This gets
again at the whole point that right now a training of one of our big nets can take a month.
And what we'd like to do is be able to train up a hundred nets and be able to do it in a day
and be able to experiment with many, many more features.
Morgan, did you want to add anything else to the summary slide?
>> Nelson Morgan: Yeah, I guess the other thing for the last bullet is that right now
we're using our own homegrown software because these things are proceeding in parallel,
the software frameworks, et cetera. Development is happening in our lab and our work in
speech. We're not using any of that stuff right now. And it'd be nice to see if -- as that
stuff develops further if it could be useful for us.
>> Adam Janin: I'm talking about the mainstream sort of other ICSI researchers.
>>: Do you know whether [inaudible] pare down of the SI decoder's components to
enable all the parallel completions done by Jike Chong?
>> Adam Janin: Yeah.
>>: How much does it reduce the performance as you pare down those components?
>> Adam Janin: I could e-mail that to you; I don't have it off the top of my head. It was
I believe around 4 or 5 percent.
>>: [inaudible] harder to parallelize.
>> Adam Janin: Well, again, the issue, most of the reduction had to do with just fitting it
in memory on the NVIDIA card. We wanted to be able to have a fair comparison
between the NVIDIA card and the multistream system. And so we just had to take, for
example, language models that were much more heavily pruned than we normally would.
So the language model is just smaller than it would be for our full-grown system.
>>: So which parts of the application do you really want to run locally on your handheld
as opposed to the cloud? It seems like the training might be done on the cloud.
>> Eric Battenberg: Yeah. Absolutely. So the training almost certainly you do all your
training on a cloud. You might do some adaptation locally. And then in terms of
runtime, ICSI has participated a little bit in some of the distributed speech recognition
tasks, the idea being that you do some of the computation on your phone, basically all the
front-end stuff, and then you ship that all off digitally to a back end rather than just either
shipping audio from your phone which introduces its own set of artifacts or trying to do
all the recognition on your phone.
But I think as the processing power gets greater and greater, you'll see more and more of
it migrating all the way down to your phone.
For some tasks you can do that already. I think most of you probably have cell phones
that will do voice dialling already. And that works perfectly fine because it's right
against your mouth, it's in a quiet environment, and the vocabulary is small. But if you
put the same thing on the center of a table and you try to record a meeting, you wouldn't
get half the words right with the current technology.
>>: Let's thank our speaker and take a break.
[applause]
Download