>>: We'll get started with two talks on music and audio processing. So first we have Eric Battenberg from UC Berkeley. >> Eric Battenberg: All right. My name's Eric Battenberg and I'm with the Center For New Music and Audio Technologies and the Par Lab at UC Berkeley. We call this place CNMAT for short. That's that cool symmetric logo up there next to our bridge. And this talk is entitled the "Breadth of Applications for Music." And I'm not going to be focusing on one particular application but a few different things we're working on at CNMAT. So there are hundreds -- literally hundreds of applications for music and also plug-ins. And plug-ins are things you can -- little pieces of code you can use to extend existing music applications. So these are the four main areas we're working on at CNMAT and at the Par Lab for music applications. The first is performance and composition, and put some cool logos to grab people's attention here. First Guitar Hero, that's one way to interact with the computer musically. Also the Max/MSP visual programming environment is something that musicians have been using a lot. And Pro Tools is the most popular audio editing software, and that's just a way for amateur musicians to edit their audio at home. Second, music information retrieval, and that's a lot like content-based image retrieval. It hasn't seen as much academic attention, but there are companies popping up doing a bit of this, and then there's also this IRCAM Institute in France that does a lot of research in that area. Third, hearing augmentation for music. And this is just for hearing-impaired people to enjoy music more. The types of processing that you need to do to enhance how music sounds for the hearing impaired is very different from what you need to do to make speech more intelligible. So we're working on that. And this is a little subjective space that we train with a neural network that enables the hearing aid wearer to kind of navigate around and decide on a fitting that's best for them, a set of parameters that process the audio according to their specific hearing loss. And last is three-dimensional sound. This is large arrays of speakers and microphones for reproducing three-dimensional sound. We're working with Meyer Sound. And over here we have an example of a 120-channel spherical speaker array. It's about this big and it's used to recreate really cool three-dimensional sounds, real-sounding instruments because you get really interesting radiation patterns in real life rather than just a single speaker pointing in a certain direction. So in this talk a little bit of background on music applications, and then some insights into music and parallel computing, the implications that music applications have that -the requirements that they need from parallel computing in order to make it to work, in order to make it work. And then I'm going to talk about a case study, parallelizing drum track extraction which is a particular type of music information retrieval source separation, on OpenMP and CUDA. These are -- this is targeting GPU architectures and video GPUs and just multicore processors in general. But before that, I'm going to talk a little bit about how to communicate the computational needs of this project here using parallel design patterns. And this is just a high-level way to kind of understand what the needs are, what things are being done in my application. And last a fun brainstorm, something cool, the future of music information retrieval. And hopefully it will be a fun, motivating example. So for music composition we're working with different musical interfaces, electronic interfaces. These two are developed at CNMAT. This is a touch-sensitive drum with conductive fabric. You can do interesting signal processing on that. We have this 24-pad Multi-Touch Array that is pressure and location sensitive. And David Wessel performs some interesting pieces with this. And last, we're not working directly on this, but we're somewhat involved. This is the Reactable -- yeah, Reactable -- it's a cool table where you place these little tangibles, they call them, little things on here and make really cool music. I suggest you Google that, because it's a really cool example and it's really easy for novices to play around with and make really cool sounds. For audio editing and audio recording, there are tons of amateur musicians nowadays making affordable home studios just using a personal computer, a sound card, and audio editing software like Pro Tools. And they call this a digital audio workstation. And you can get really professional quality sound out of these just using all these audio effects if you have a good microphone. I guess I should have put a microphone on there too. That would be a part of the cost. So the power of these digital audio workstations lies in plug-ins. These are just little things you can buy that incorporate different effects, compressors to get the levels balanced. Different cool effects like Auto-Tune. Is anyone familiar with Auto-Tune? This is not auto tuning like a kernel, but Auto-Tune that T-Pain uses to make him sound like a good singer when he's not. That corrects your pitch. It's automatic pitch correction. So different plug-ins like that. And these -- they're developed by third parties or they're included with the audio editing software. And if we're going to parallelize these, we need to ask are they thread safe; that is, are we going to get the same behavior out of them regardless of how they're scheduled on the machine, or are we going to get a crash, are we going to get completely different sounds out of these things if they're scheduled differently and there's different amounts of delays. When they're composed, are they going to cause catastrophic performance conflicts; that is, is the performance just going to become unbearably slow when these are competing for resources within the application. And last, will they appropriately share hardware resources with other applications running on the same machine. Because realtime performance is so important for audio applications, particularly live performances, sharing hardware resources is really important. So to answer these questions, the Par Lab has been working on two little projects -- not little projects, two big projects. This is -- the Tesselation OS is how they're envisioning a space/time partitioning amongst applications. And then down here is the Lithe, and that's a second-level within your application how are you partitioning resources amongst the different libraries and system calls so that you aren't computing for cache, destroying your cache when you're timesharing a core. And for the OS -- at the OS level, another thing that music is going to need for realtime performance is timing and deadline guarantees. So if you're doing realtime performance and you don't meet a deadline, a timing deadline, you're going to get a click or a pop. Also if two things are split onto different cores, they may end up being written to the audio buffer at different times and you'll get a completely different sound out of it. And I'll have an example of that in a second. So the next point I want to make is that music is inherently very parallel. This is just a score of different instruments playing. And you can see all the voices. It's like 20, 30 something voices all going in a very data parallel way. But the important thing here is that synchronization, audio synchronization and timing are very important. If these things aren't occurring right in time, the ear is going to be upset. So timing -- here's an example. It's going to be an audio example. Hopefully you'll be able to here it. If we have a copy of a piece of audio and we process another copy of it on a separate core and we miss a deadline -- that is, we -- the audio buffer needs to write to the digital analog converter at a certain time -- and the other core doesn't meet that deadline, it's going to be writing a frame late. Right? So if we're just delaying another copy of the audio by 1 millisecond, we get this combing effect. Anyone familiar with digital signal processing to see that this is like a comb filter, we get these notches in frequency. And just with the 1-millisecond delay, I'm going to show you an example of this audio here [audio playing]. Noticeable. Did everyone hear the little difference there? It sounds a lot more hollow, right? And we got this combing effect. It didn't just -- it wasn't just distortion and clicks, it was actually like changing how the sound sounds. And for people who are spending tons of time getting the sound just right on their compositions, this is very troublesome to them. It's like the main reason that IRCAM sort of abandoned parallel audio processing back in the day, because they couldn't get all this timing right. Things would just not sound deterministically correct. So one way around this is Open Sound Control. This is a standard developed at CNMAT and IRCAM. And it's a way to kind of communicate audio performance data, sort of like MIDI back in -- well, they use MIDI a little bit. But it operates over Ethernet. And the important thing for us here is that when you're sending audio performance data, you can include these high-resolution time tags. And these time tags can be used to synchronize. If you'd say when this audio event occurred and it arrives at the audio buffer, then you know that you need to delay something else, if it has been delayed, in fact, so then you get the synchronization there. All right? So next area, which is my area, I'm working on music information retrieval, and we call that MIR. Also at MIT they have a machine listening group. We like to call it music understanding, because we feel that's a little more all-encompassing than just information retrieval. And that's sort of capturing the psychological aspect of a human experience with music. Up here in the corner, in the middle, is an audio waveform. And this is an example of music transcription. You're producing a musical score at the top there from the audio events. Also at the bottom, that's called a piano roll notation; it's like a time-and-frequency or time-note based plot. That's one application of music information retrieval, automatic transcription. We can also do source separation, which is isolating different instruments for analysis. Similarity, playlist creation. Playlist creation is I think the thing that most people can wrap their heads around as being useful, just anyone who listens to music say I'm in a mood for a certain Led Zeppelin song that it doesn't rock out too hard, it's pretty easy listening. But I want to listen to some classic rock like that. I'll give it this Led Zeppelin song and find me other songs like this and make a playlist for me, right, instead of just saying, okay, I'm in the mood for this Led Zeppelin song and then I end up listening to the rest of Led Zeppelin's album. And that's usually how most people listen to this stuff. Unless maybe you go to Pandora or something, right? Pandora will do that for you, but at the expense of you're connected to their server. And also all that information that they use for comparison is made by humans. So people are listening to the music, checking off different aspects of it. Maybe that's not the category you want to be comparing. So also we can classify by mood, artist. Score following is important for automated accompaniment. So if you want to play a solo instrument and have the computer automatically follow along where you are in the score and play some computer instruments along with you. Also lyric synchronization. That's for automatic karaoke. And last, song segmentation, splitting up a song into verse, chorus, bridge, et cetera, because they may be pretty different parts, and you can analyze them independently then. So the hope with all this technology is that someday you can query for music like this: I like the drummer but I can't stand the singer. Find me something in the same genre with drumming like this but with a singer that sounds more like John Lennon or Rod Stewart or someone you really like. And this isn't that far-fetched. This kind of thing can be possible in the next few years. All right. So as a case study for parallelizing music information retrieval I looked at drum track extraction. And this is a particular example of source separation where you take an audio waveform in and output an audio waveform hopefully containing only the drums or only the percussion. And this is done -- first there's spectral feature extraction. You take out a spectrogram, and in this case we're operating on about 20 seconds of audio at a time. Then Non-negative Matrix Factorization, that's NMF here, is used to split up the spectrogram matrix into time and frequency components. And you can select the number of components you want in this. Usually I use about 30 components just in popular music. There's enough things going on at once that 30 components works pretty well. And then you extract features from the components so that you can classify the components as either percussive or not or drums or not; and then you resynthesize the audio corresponding to those components. And the important thing computationally is that 80 percent of the time in a MATLAB implementation is spent on the Non-negative Matrix Factorization. That's the unsupervised learning routine that is iterative and it takes a lot of computation. And so it's -- in the MATLAB implementation, this is my optimized MATLAB implementation, some naive implementations take about 90 seconds to run on 20 seconds of audio, which is totally infeasible. We got it down about 23 seconds, which people asking like how did you optimize MATLAB. It's MATLAB. But avoiding four loops, using single precision when you don't need to use double-precision audio is very single precision, the samples are 16 bits usually. That's how we got it down. Also not computing things. Some people do some silly stuff in MATLAB. Anyway, we got it down to about almost realtime, 23 seconds. But we're going to parallelize this and try to get it down a lot using OpenMP for multicore machines and CUDA for GPUs. Yeah. >>: [inaudible] >> Eric Battenberg: The SVM. Very tediously. I ran it on lots of columns of audio, listened to the components. Drums, not drums; drums, not drums. I made a little interface to do it really quickly. Yeah, I hand did it, about a hundred clips of audio, and then I cross validated everything with pretty good results. The classification is 2 or 3 percent error rate actually. Yeah. So here's some audio examples. And also you get a prize if you name the artist and song. I'll buy you a beer at next year's apps workshop or something. And if you just know an artist, then the person sitting next to you owes you a high five. Let's go. Number one. [music playing] everyone knows who this is hopefully. Some people probably know what this song is, right? Black Magic Woman. Now, it's important to listen to the audio, to the drums in the original version. Most people just hear Santana playing. If you're a drummer, you might have heard all these drums here. Now, the drums aren't -- I mean, it did a pretty good job of separating out. You don't hear the guitar at all. The drums aren't the most -- they're not completely pleasing sounding because the way the Non-negative Matrix Factorization works is that it has to smash out the frequency components, all the contributions from all the guitar stuff. So that stuff can be compensated for later. We didn't do any fancy signal processing at the end. This is just the raw output of the Non-negative Matrix Factorization. But also this could be used for drum transcription, other kinds of things. Okay. Here's the next song. This will be different. [music playing] does anyone know who this is? No one. An entire room. No? OutKast. Has anyone heard of OutKast? They rap. They're a rap group. It's a different kind of music from Santana. All right. [music playing] now, this is done completely automatically. So I didn't do this by hand or anything. [music playing] raise your hand if you know the name of this song. Raise your hand if you know the artist. Four -- three people. All right. Led Zeppelin. They're a band too. They're some kind of new band. They like just came out with this song. It's called Black Dog. [music playing] you can hear there's a little bit of distortion in this back drum and a little bit of influence of the phase, on the phase from the singer's voice and the guitars and stuff, but it does a pretty good job. All right. Here's what -- kind of what comes out of the Non-negative Matrix Factorization. This is an audio spectrogram here. It's kind of flipped upside down so it looks like a matrix. But it's a lined with the drum score here. This is just only drums here. So to get an idea of what comes out of this, factorize into three components. And the three components -- these are the spectral contributions or whenever something occurs in this row here we're positively scaling the contribution of this part to the spectrum here. And this is the time -- the time contribution of each of the components. And it's aligned with the drum track here. So you can see all the hi-hat hits or lining up with the hi-hats in the score and the bass drum and snare drum, et cetera. >>: I've got a question. >> Eric Battenberg: Yeah. >>: When somebody hits the hi-hat, depending on how hard they hit the hi-hat, doesn't that change the timbre and the frequency? >> Eric Battenberg: It does, yes. So this -- the reason Non-negative Matrix Factorization works pretty well for drums is that the spectrum for a drum is fairly stationary. How hard you hit it, it definitely changes the timbre. And if you are -- this is a very canned example because these are -- I just synthesized this using the same drum sound each time. So the reason you use more than the number of instruments you have as the number of components is because sometimes things sound different, right? So if he's going to hit the snare drum really hard, the timbre will change; that will end up being a different component. But as long as it's classified in the end as drums, as a spectrum that sounds like drums or a time component that sounds like drums, it will end up in the drum track at the end. Yeah. >>: So different hits to the snare drum will be different instruments? >> Eric Battenberg: Yes. That's true. Yeah. Hit it harder, hit it on the rim, hit it on the side, different sounds. Yeah. All right. So Non-negative Matrix Factorization is an optimization problem. And the important thing here is that we're minimizing cost function. And instead of just using a mean squared error for the reconstruction between the product of those two matrices and the original matrix, we're going to use this divergence function. It's similar to the Kullback-Leibler divergence. It's a log-based function. And this has been shown to work well for music. And the important thing here is these are the gradient-based updates that iteratively minimize this. And this cost function here is not convex in both W and H. So we will be arriving at a local minimum. And some people get around this by doing the optimization multiple times and using the best result. But it is -- you do come up with a non-global solution. But it still works pretty well most of the time. Computationally, for the typical 20 seconds of audio is you're factorizing a 512 by about 3,000 matrix. And we do it into 30 sources. So for each iteration, this can be broken down into about 400 megaflops of Single-precision Matrix Multiplies and 3.6 megaflops of divides which are pretty slow for those of you who -- yeah? >>: So megaflops is the number of flops, not flops per second? >> Eric Battenberg: Yes. I usually use lower case when I'm -- number of operations and all upper case -- what is the convention? Is there a convention for number of flops per second? Okay. Yeah. And then also important for parallel we have some sums here. It's not much work, but if you're going to parallelize this onto a lot of threads, it requires a lot of communication to do sums in this way. And also we do some other things here. Compute a log-based cost function every 25 iterations, which is pretty slow, because logs are slow. So how can I communicate to you what this app involves, what the important design considerations were in this application. How can I come up with some jargon that will help you understand what went into this and also communicate to other application experts, like Jim Demmel, what I need out of this application. And one thing we're working on -- this is sort of a plug for something we're working on at UC Berkeley, is this parallel design pattern language. And we call it OPL. It's just a working name, Our Pattern Language. But it's hierarchical. And you start up here with -- I don't know if you can read all this, but, for example, dense linear algebra is one of the computational patterns here. And depending upon what architecture I'm parallelizing this on, this may be decomposed into different patterns down here. So dense linear algebra. We're doing some matrix multiplication, we're using this pattern. If we're doing it -- if we're going to do a block decomposition, we can use the geometric decomposition pattern, later on maybe distributed array pattern and some SIMD. So these patterns will be decomposed hierarchically. And the importance of this is to help communicate to new parallel programmers or just programmers in general who are starting with this sort of high-performance computing jargon here about the best practices about -- I mean, because HPC has been -- parallel computing has been going on for a long time in the HPC community, only with applications developers has it been starting out more recently. Gives us a common terminology and helps guide the design process. >>: [inaudible] next week actually we're having a joint workshop with [inaudible] Illinois and Berkeley to discuss this pattern language right after the Par Lab retreat. So if any of you guys are staying -- are coming for the Par Lab retreat, you can stay and visit. >> Eric Battenberg: All right. So, for example, just this leg of the iterations updating the H, which is the time contributions, we can decompose this into this little block diagram of matrix multiplies, element divides, and then column sums. And then when we decompose this into the pattern language, we start with this high-level compositional structure, just pipe-and-filter, basic consumer/producer architecture. And within that we have little blocks that are made up of matrix multiplies, sums, and element-wise arithmetic, which is completely data parallel and sort of naive here. But, for example, in the sums, we apply the MapReduce pattern, which is just an example of a way to do a reduction. Sum is a particular example of a reduction. And then in CUDA, which we are targeting with this particular implementation, we use the graph algorithm's patternings, which is we're doing the reduction using a binary tree. So we are traversing this binary tree in a particular way. And the pointers and the graph algorithm pattern, it's the actual document. Each of these corresponds to an actual document written about the best practices and pointers to helpful resources for programming these things. That can be decomposed all the way down to SIMD and collective synchronization on the CUDA architecture. So hopefully you have a better idea of like what goes into this, all the computation here. We have the matrix multiplies, sums and element-wise arithmetic. Okay. So for the OpenMP implementation, this was really easy. I did this like in 30 minutes. And I took my sequential code, wrote some reduction routines, wrote some data for parallel four loops. Who's written OpenMP here? Raise your hand. Everyone. No. Not everyone. A lot of people. Who's read about OpenMP? Like who's seen this example for OpenMP? Okay. So this -- all you need to do is add one line. And if your work is naively -- is embarrassingly parallel, it will just divide the work up amongst course. So you add that in combination with this reduction version of that, you just define the reduction operator edition and the final variable that you're going to be summing all the partial sums into and each of these threads can compute a partial sum. So that's the OpenMP I did for this. And, yeah, it's really impressive, right? So we use Intel's Math Kernel Library for all the matrix multiplies, which has -- uses OpenMP under the hood, so you can control the number of threads there. And we show the performance scaling on a dual-socket Nehalem machine. It goes from 11 and a quarter seconds down to 14 threads, 2.6 seconds. And we get like about a 2 times speedup going from one to two threads. But as we get to 14, we get to the minimum, it's definitely nonlinear scaling. So this probably won't scale well on the same architecture, on using the same programming model. Though we did get a 7 times speedup compared to the MATLAB implementation, which down to 2.6 seconds from 20, that's getting pretty good for 20 seconds of audio. All right. So CUDA, I'm not going to go -- who's actually programmed CUDA? Okay. That's good. I'm not going to like get into this. You've probably seen this before. But the idea is you're running lots of threads in CUDA. And threads can be grouped into thread blocks. And within thread blocks you can share memory, and that's how communication is done. Physically, these threads are executed in groups of threads of 32 called warps. And if all the threads within a warp all do the same thing, then we get SIMD. So if threads within the same warp are not doing the same thing, this is called a divergent thread. And that means that these threads are going to be executed sequentially, and we lose the SIMD speedup. So down here we see an example of a CUDA kernel which is run on the GPU, and this is just accomplishing element-wise addition. And each thread here just does one addition. So you launch the number of threads, which is the size of your array, and you call this kernel down here using B blocks -- B thread blocks of size N each. That's how you tell the device how many threads to use. And each of these threads operates on a particular element, computed from its thread ID and its thread block ID and everything. So luckily I didn't have to program a matrix multiply in CUDA. That was taken care of for me by one of Jim's students [inaudible]. He came up with this really fast implementation that achieves 60 percent of peak on the GTX280, about 370 megaflops. So I found also that padding by matrices to multiples of 32, because that worked pretty well, it decreased the running time by about 26 percent. Also the element-wise arithmetic, pretty naive. It was similar to the example on the previous page. The hardest thing to program, though, because this wasn't provided in any libraries or anything, but I did find some good pointers in the CUDA SDK about how to optimize reduction, parallel reduction here. So parallel reduction is accomplished in CUDA best using a binary tree traversal. So each thread copies memory to shared memory, local shared memory from global memory. And then we do a series of two element reductions down until we arrive at the final sum. Now, there's different ways of organizing this, different ways of assigning threads and how they operate on different elements. And the optimal way of doing that is to assign threads so they're adjacent to one another and assign them to have a strided memory access pattern. And you can read about that in -- if you're interested, in the CUDA programming manual. But the important thing here is that all the sums are adjacent. And then all the threads that aren't working are adjacent as well. So you're avoiding divergent warps here because here every other thread is not working that's not SIMD. All right. And then applying these optimizations, like loop unrolling. So reorganizing this tree, we're moving down, we're increasing the amount of -- the number of optimizations. Reorganizing the tree, we got to here, we apply some loop unrolling. And, last, running 30 sums concurrently. So there are 30 different sums in this Non-negative Matrix Factorization of length 512. Now, those are very long sums. They aren't going to give us much work to parallelize. So if we run them all concurrently, we get all the way down here to about 3 milliseconds from 155 milliseconds. So that was the most important optimization. Yeah. And that's -- that's a lot of work just for a sum. But it paid off. And you can see the results we get, our original optimized MATLAB implementation goes from 18.5 seconds all the way down to .6 seconds. And it's like a little over 30x speedup. We're doing 20 seconds of audio and .6 seconds, which is pretty good. And the OpenMP implementation, still, pretty good speedup. 2.6 seconds for 20 seconds of audio. And you can see that the SGEMM, the matrix multiplies is like the main contributor -- the main thing that achieves speedup here. I don't know why the MATLAB takes so long to do element divides. That's kind of silly. So yeah. The important thing here, CUDA achieves a lot higher performance, OpenMP, though, a lot less work. So for people who do programming music applications who aren't expert programmers, CUDA is going to be a bit of a turnoff. When you introduce inner-thread communication through shared memory, CUDA gets a heck of a lot harder. So CUDA would only be feasible for computational kernels that require a lot of performance. And also for audio, there's this question of we haven't done this yet, but processing audio in real time, going back and forth between the GP, are we going to be able to achieve the latency required. And we're going to be releasing Python modules for the music information retrieval community. So to finish up, this fun brainstorm idea, combine music information retrieval analysis with gestural processing or just interaction with computers to come up with like a custom music mix or a custom music performance. So this is applicable to automatic mashups, which is the buzz word now for taking things that you didn't write and putting them together and saying you did. [laughter] >> Eric Battenberg: Gestural music selection, so you can imagine like using that cool touchscreen thing at the spitfire last night, you have that on a wall and you say I don't like this song, and based on all your -- so based on all your preference -- personal preference data, an audio database in the cloud of music and some music information retrieval, maybe we could navigate a personal space of songs we like. So let's say we're at a party and the song is just -- is too loud or it's too -- it's getting everyone too rowdy and you want to tone it down a bit, you can kind of traverse -- you could traverse this space, come up with a point where it fits the mood correctly and just stop there. So maybe seamlessly moving between different songs until you get to a point where you think it fits your mood the best and then just leave it there and let the computer decide what fits that mood best. And you can combine all this music information retrieval with gestural processing, cool interfaces, and just come up with a -- you know, a custom audio mix. And this can be used for music performance, if you want to accompany yourself with different instruments pulled apart from different pieces of music and resynthesize later as sort of a mashup, or just for music listening, interactive music listening, where you're controlling your own musical experience rather than just clicking on individual songs. It can be very much personalized, and you can have as little or as much interaction as you want. All right. So to wrap up, there are lots of music applications and both for musicians and for music fans, listeners. And parallel computing enables new applications. That previous example here will definitely be enabled by parallel computing, you know, categorizing tons of audio data, processing these complicated Multi-Touch sensors; that will all be enabled in the future due to parallel computing. But synchronization is really important as well. Because timing is so important to the human ear, we're going to need to put some emphasis on that for realtime music. And parallel design patterns. It's a cool way to organize -- to organize and communicate your code. And I encourage you to at least look into that as a way to instead of -- as a way to communicate the computational needs that you have. And, last, are there any questions? Yeah. >>: You talked about some of the experimentation with processing being done in MATLAB. Have you ever tried their parallel toolbox? >> Eric Battenberg: I have not. The kernels that I wrote -- the computational kernels that are being called in MATLAB here, matrix multiply, FFT, those kinds of things, are already parallelized in MATLAB, I guess. So when I quoted this number, the number back there, that was the two threads on a laptop, right, MATLAB automatically uses -- I don't know what it uses under the hood, but it's calling BLAS routines that are already parallelized. I haven't used MATLAB's parallel processing toolbox. I don't ->>: [inaudible] >> Eric Battenberg: Yeah. I okay. [inaudible] BLAS. >>: [inaudible] MATLAB's parallel toolbox and then found it rather frustrating because the way that it works is it spawns a new process. You can't actually spawn a thread. It spawns a new MATLAB process and you're limited to only using four of them. So I have a Nehalem that has eight threads but I can't use them because MATLAB's licenses are [inaudible]. >> Eric Battenberg: [inaudible] license for every core. >>: [inaudible] about enabling domain-specific languages or languages that are more natural to the domain. So I was curious if there's any help at that level of abstraction. It doesn't sound like it, does it. >> Eric Battenberg: No. >>: There's some baby steps, but I don't think it's quite there yet. >> Eric Battenberg: No, not in MATLAB anyway. MATLAB is -- yeah. Licenses in MATLAB is too proprietary. I think a lot of people are moving -- in the music information retrieval community are moving towards Python actually, because it's open source, it's free. And people are writing cool modules. I'm going to be releasing Python modules that you can call from within Python to run CUDA code or run OpenPL versions of this -- OpenMP versions of this. So yeah. Any more questions? Yeah. >>: Max/MSM is a domain-specific language? >> Eric Battenberg: Yeah. Max/MSP is a visual dataflow language that we're working on parallelizing right now. And a lot of musicians use it. I don't know if it's the best for music information retrieval, but for live audio performance, definitely really good. Yeah. >>: So when you're doing the Non-negative Matrix Factorization, do you get away with doing that once per song or do you have to do it about every second [inaudible]? >> Eric Battenberg: You do it every 20 seconds, or it works best when you're working on separate like chunks of audio. So 10 or 20 seconds of audio. So for an entire song, you have to do it a few different times. >>: You can't just do it once in the beginning and use the same type of performance for the rest of the song? >> Eric Battenberg: You could, but the reason you want to split it up is because instruments change throughout a song. If the song was completely homogeneous -- the same drums, the same guitar the entire song -- it would be easy to just do it once. But that would be a pretty boring song. >>: All right. Well, let's thank our speaker. [applause] >> Nelson Morgan: Okay. That's a tough act to follow, actually. It's tough to follow music with boring old speech. But hopefully I won't make it too boring. You see the title. And I have an admission to make, which is I don't actually care about parallelizing speech recognition. I care about making it better. But it turns out that there are at least some possibilities, I think, for making it much better by using a lot more computation. And since the way the world seems to be that to get more -- lots more computation in the future, we need to parallelize, I'm interested. I'm Morgan. Adam Janin is here, and he'll be giving the second part of the talk. Jike Chong couldn't make it. He did a lot of the work that Adam will describe later on in the talk in addition to Adam's own work. And of course we're from Berkeley. We're also from the International Computer Science Institute, closely affiliated institute a couple blocks off campus. So my fantasy is that we can make speech recognition much better by using a lot more computation and having many streams on many cores. And I'll explain what that means. I have some results from that. And also part of what we use to get this is some neural network components. And Adam will be talking about that along with the work that Jike did on parallelizing a complete unistream system. And unistream is just a one stream system. And, again, shortly I'll explain what I mean. Oh, in fact, right there I'll explain what I mean. This is just kind of a generic picture of the components of a speech recognizer. Everybody uses something like this. Of course there's lots more details inside of each box. But basically from the speech people compute some kind of signal processing variables. Those variables are passed to something that estimate the probability of different speech sounds being responsible for these variables that you have locally; that is, say a hundred times a second. Little 20- or 30-millisecond chunks stepped along every 10 milliseconds. These are going to give you probabilities, but the way that they are able to give you probabilities is that there are acoustic parameters which are -- which come from models that have been trained on a lot of data offline. And these local probabilities are then fed to another box along with prior information that you have about the nature of the language such as how probable it is that a particular word is going to follow two or three or more words. And the composition of a word in terms of those speech sounds that you're getting the probabilities for. And at the end you get a stream of words. So that's your basic system. And if you've got lots of data to get all of the training information and your testing situation is roughly the same as your training situation, this actually works reasonably well. So that's sort of what I say in this top bullet here. Speech recognition works pretty well under good conditions with plentiful resources. So if you have lots and lots of training that corresponds to the way you're going to test it and there's not a lot of noise or not a lot of reverberation, you can do pretty well. Depending on the task, you can do down to a few percent. We recently did a Mandarin broadcast news task and once we got enough data, over a thousand hours of data, we got the broadcast news down to a couple percent. On the other hand, Mandarin broadcast conversation, even without any particular noise and reverberation, it was more like 10 percent. So you do get much poor performance for some conditions. And particularly if you get some combination of noise and reverberation and relaxed conversational kind of speaking style, you can easily get over 30 percent of the words wrong. Yes, sir. >>: [inaudible] percent acceptable? >> Nelson Morgan: Well, it depends on the application. But there's -- human beings generally don't do particularly better than that. But if you actually look at results of, for instance, just recognizing numbers by human beings in a call center, after lunch it's terrible. After coffee, it's pretty good. >>: Even 30 percent is acceptable for some applications. For example, information retrieval, 30 percent doesn't have a whole lot of effect because most of the speech recognition errors are of versus the, in versus on, things that are on the stop list for information retrieval anyway. >> Nelson Morgan: Right. So 2 percent could be not good enough if you were doing something stupid like trying to control a scalpel. >>: Firing missiles. >> Nelson Morgan: Firing missiles. Right. Yeah. Then 2 percent is not good enough. But, on the other hand, as Adam's saying, if you are doing voice search, maybe 30 percent would be good enough. But there's certainly a range of 57 cases where you'd like it to be much better than 30 percent. And you do have those kind of numbers for a lot of datasets. So what people have already done is made use of several different signal processing methods, not just single one. In fact, a number of the best systems have two or three different kinds of ways of looking at the speech data, and they incorporate them in different ways, and they do pretty well. But in general people have found once you have two or three of these, if you go to a fourth or fifth or a sixth, you get diminishing returns. So I've been pursuing sort of the counterview, which is maybe there's a way that you could use a huge number of them and actually get better robustness. I'll show here, this is basically the same picture, except we have different signal processing modules. And the reason why I think this could help is that it could be that -- say, if you had thousands of these, maybe you only need to do three or four, but in a different situation, acoustically and in terms of speaking style, et cetera, it might be a different three or four. So where this fits into parallelism? Well, the front end that I'm imagining is embarrassingly parallel. Now, as the front end is now -- that is, what I'm calling the front end is the signal processing's chunk -- saying that you're going to parallelize that to anybody who knows this field would sort of say, yeah, why bother, because maybe it's 10 percent of the computation. But I'm talking about expanding it out and making it lots more computation, in which case you really would need to parallelize it in the future in order to get the throughput. And then it would maybe be 90 percent of the computation instead of 10 percent. And I just added this yesterday. I realized that there's a little bit of analogy with the multiview Photosynth stuff that was discussed yesterday morning, that you have many, many different views of the same scene. The difference is that, at least as it was described yesterday, this was in order to give you the possibility of viewing it many different ways, constructing the model, and so forth. And here the major reason for it is that some of these views will be obscured and you want to take advantage of the ones that aren't. So here's a picture of how we would expand out the front end. Right now what people do is compute something like an auditory spectrum. And what you mean by auditory varies. But essentially it's a time-frequency plane with a few auditory properties. One that everybody uses is that there's less resolution at high frequencies than at low frequencies. But you might have some other auditory properties in there. And then you split up the data. So there's a factorization discussed in the last talk. That's one way you could split up the data. The other is just sort of all over the place. Do different Gabor filters, much as is done in Vision, with different kinds of scales and range and so forth. And then these -- the output of these Gabor filters go into multilayer perceptrons, and these are trained descriptively. So this accomplishes -- they accomplish two things: first, since they are trained descriptively, between different speech sounds, they give you features which are somewhat descriptively -- which are somewhat discriminative; they're pretty good at discriminating between different speech sounds. And the second thing is they output posterior probabilities of these different speech sounds. And the nice thing about posterior probabilities is that there's all sorts of cools ways to combine them. So then the last box -- I'll come back to the second to the last box, but the last box over there is just some transformations that we do to make them nice features for Gaussian mixtures to model. Basically make them roughly orthogonal and roughly Gaussian. So the big questions -- we sort of knew how to do all this. The big questions come in here. Because, again, I think if you just combine in a naive way a thousand different streams or a hundred different streams, it's probably not going to work too well. And so you want to do something that perhaps selects or at the very least strongly de-emphasizes a lot of these streams. But the main thing -- the main point I want to get at here is that when I -- if I was trying to do recognition on this speech, I wouldn't want to have to have come in here before and trained up on a thousand hours of other people speaking in this room. You don't have to do that. If you've never been in this room before, you came in here, and you're still able to understand me. Hopefully. But that's where we are in speech recognition now, is that we have to collect huge amounts of data to go for a new genre, a new language, a new -- just a new room. And that's a bad situation. So the idea is to have lots of different ways of looking at the speech, and then to be able to switch between them or combine them in some good way. So I think it would take a longer talk to explain why I think this would work, but the practical engineering thing is that we've tried some of these ideas out and at least on a small scale they do work. This is a set of results for numbers recognition. Numbers is not just digits. Numbers includes things like 16 and 60, 6, 16, 60, which are all pretty similar. Our best results are sort of under 3 percent with a sort of standard system. If we had some of these streams, we get a couple percent, which is getting rid of all a third of the errors. If we add lots more streams, we clearly don't know how to do that quite right yet, at least for clean data, it does get a little bit worse. But the noisy case, we made this much tougher. We added all kinds of noises with signal noise ratios varying from zero to 20 DB signalized ratio. And we end up -- and on many different -- it's not white noise, it's many different natural noises. Speech babble, inside of an F16, inside of cars, all sorts of things. And you end up worsening by quite a bit, by a factor of five. And we get rid of about half of the errors. So that works pretty well. And, by the way, we want to point out this is a postdoc of mine, Sherry Zhao, who did this work, and she's currently extending from 28 streams to a hundred streams. So least for the noisy case, which is kind of what we're trying to get at, since for many applications all of these numbers would have been okay for clean, but probably not so much for the noisy case ->>: So the back end for all of those are the same, SRI [inaudible]. >> Nelson Morgan: Yeah. This is SRI's -- yeah, we're using SRI's decipher, which is pretty advanced. >>: So you don't have to change anything in the back end [inaudible]. >> Nelson Morgan: We haven't changed anything, no. Then we went to a larger task. Still this case it's not noisy. It is -- some of it's conversational data, which is called broadcast conversation. So it's not that simple. The MFCC baseline is about 25 percent and is about 22 percent if we add a few streams. So we haven't gone up to more streams yet. We haven't added noise yet. This is one of my students, Suman Ravuri, who's doing this. But it does -- again, this is a clean case so the improvement isn't as dramatic. It's about 16 percent. You generally see a reduction in how much cool techniques help when you go from a small task to a large task. But it's still there. And I think if we were looking at a noisy case, it would be much bigger. Okay. So how does this relate to parallelization? We're going to expand the easy -- the easily parallelizable stuff but computationally dominant parts, at least in this many-stream image, the feature streams multilayer perceptrons, which take a lot, because these are, by the way, big multilayer perceptrons. They can be millions of parameters for each one. I should also mention -- although I'm talking here about recognition, one of our big concerns for parallelization is training. Because right now it takes us a couple of weeks to train one of these MLPs. And we'd like that to be an overnight. And the Gaussians for the statistical engine, those are all pretty regular and they're pretty straightforward to parallelize but very important. Then there's parallelizing a full unistream system, and this is Jike Chong's work that Adam will talk about in a moment. But ultimately what we want to do is work with a full application, which we haven't done for Par Lab yet. And of course the reason why we want to do that, take, you know, full application, is because we want to watch out for [inaudible]. There may be other things we're just saying, oh, that's not very important, and they could kill us. So we want to ultimately end up with a demoable version of a complete meeting diarizer or browser. This is a picture of one that was done as part of a European Union project where you have time moving along here and you have different speakers, so you're separating out by who's speaking when, and you have a transcription there and then there's other notes that can go off to the right such as summarizations. When the Europeans did their collections, they also did video, and so there's other -- video can be attached to it. So these things already exist at a research level. But, again, the performance is at this point still not all that great. We'd like to improve that and also be able to handle more ad hoc meetings that might occur in noisy or reverberant environments. Okay. So, again, just to give credit, Sherry Zhao and Suman Ravuri are working on this trying to make the things better. At ICSI Adam and Chris Oei who works with him work on making it faster. And Jike Chong and Youngmin Yi, who are Kurt Keutzer's students, are working on this full unistream system. And that's my part. Oh, yeah, sure. >>: In the work you described, there's one back-end recognizer that's running on multiple streams. >> Nelson Morgan: Yes. >>: Is there a reason for not running multiple back-end recognizers each on one or a subset of the streams? >> Nelson Morgan: Well, I think the notion is to do the combination as early as possible so that you have as good of information to pass on later as possible. I mean, in a real system, if you were facing -- something I know you love, DARPA kind of deadlines, you might want to partly do some of each. For instance, in SRI's system, they take the two- or three-stream system that we have already and they take another system that's unistream and they cross-adapt so there's -- there's combination here and there's combination there, then they take their system and combine it as a rover-level with [inaudible] system. And so in a system where you don't care about how much junk you throw in there, you'll probably do all sorts of things. But I think the thing that we're pursuing, not that another method wouldn't work, is a notion that's more like what we think is going on in -- I didn't talk about this -- in [inaudible] auditory cortex, primary auditory cortex, where there appear to be many, many different representations. So the brain does something with that. I don't know what goes on after the primary auditory cortex. >> Adam Janin: Just to add to that, though, the other reason also is that for Par Lab we're more interested in online and real time, whereas for the computation systems, it's all batch encoding. And with a batch you can wait until you're sort of at the end of a new show and you have an entire utterance -- an entire hypothesis that you then combine with another system's entire hypothesis. It's not clear how you combine at the decoder level if you're trying to do it online, realtime online. >>: [inaudible] so when you expand for one stream to 100 stream, do you expand the auditory spectrum or do you expand the number of Gabor filters there? >> Nelson Morgan: We expand the selection of Gabor filters that go into a particular stream. So one thing that's probably a little wrong about my picture is that each of the streams actually consist of some collection of Gabor filter outputs. And so you can easily expand out to a thousand given the combinations of how wide you make them and how much of the spectrum corresponds to what Gabor filter and so on. But right now 28 to a hundred is all we're doing. But the auditory spectrum, it's just the base that we work from. And then you take chunks of it and run Gabor filters. Typically -- initially we run the same Gabor filter on all parts of the spectrum, but you could change that. >>: I thought in auditory research people typically think about having different aspect of [inaudible] like a rate synchrony, all these -- there's a lot of [inaudible]. >> Nelson Morgan: So I sometimes think of auditory in this context as meaning whatever people haven't implemented yet. Right? In real systems. Right? Because POP is auditory, but everybody has POP, so they don't think of it is auditory anymore. So I meant this in just the broadest possible sense that there's some kind of time frequency plane that has some aspects of auditory ideas in it. [inaudible] in his stuff has tended to put in more than we're doing right now. Right now we're just taking a log nil spectrum more or less. >>: [inaudible] >> Nelson Morgan: But I'm open to the idea that maybe putting more auditory aspects in there might make things better. I have no idea. Okay. Adam. >> Adam Janin: So I'm Adam Janin and I'm going to talk about some other parts of the system. I'm going to start with the multilayer perceptron, or neural network. So the neural network is a machine learning algorithm that just maps some floating point inputs to some floating point outputs. And in our case we're mapping speech features of some sort, you know, for those of you who are familiar with [inaudible] spectral features or POP features and so on. And we're trying to map that to speech sounds, is it a "ku" or is it a "tu." We do that very 10 milliseconds. Now, in a picture like this, with one neural net, even though our neural nets are quite large, you know, 16,000, 32,000 hidden units, as Morgan said, it doesn't take up that much compute time. So the reason we're interested in parallelizing it is first to handle the sorts of many, many streams that Morgan talked about in his part, but then also during training we train on thousands and thousands of hours, and it can take a month to train a big network. And we'd like to cut that down so that we can run many more experiments and try different types of neural networks and different types of features. So what I'm showing here to just describe what the computation looks like is we have an input feature which might be, I don't know, 39 features, each one is connected by a weight which is just a floating point number to a hidden layer. And so the operation, the mathematical operation here is a matrix times a vector. And then you perform a nonlinear operation at the hidden layer and then you repeat to the output layer. In a real system it's a little more complex than that because we batch up all these 37 features into one frame, and then we include in that not just the current time but the four future times and the four previous times. And so in this picture here, each one of these is the 39 features and we have 9 frames of them. But it's still a matrix times one long vector. It's just the vector is much longer now, because it's looking at a longer time frame. And we do that because the way you say a speech sound depends on the context it's in. It's not just the currents, how you say it is not dependent on just the sound you're trying to say, but the sound previous and the sound next. In addition to that, we also for efficiency reasons, take up up to 10 seconds of these and batch them all together into what's called a bunch. And that allows you to do a matrix-matrix operation rather than a matrix-vector operation. And that's of course much more efficient. At runtime the cost of doing the grouping -- the bunching, rather, is latency. You have to wait until you've gathered them all up. However, that's not really too much of a big deal because you typically have some latency introduced waiting for a word to finish anyway. And so introducing a few seconds of latency at this point isn't a real problem. The bunch sizes change the training in that you're now optimizing for something slightly different than you would if you did it one at a time. But in practice you get the same results either way. It's not very sensitive to that. And the matrix-matrix operation is much more efficient. And so in terms of parallelization, we already have an existing optimized multicore implementation called QuickNet that was developed at ICSI years ago and has been parallelized sort of over time for existing cluster computers. And we used that quite a lot at ICSI, and other places as well have used it. We've also developed a CUDA implementation. And it gets about three times the speed that QuickNet currently gets. I don't have a uniprocessor number on what the speedup is over a single thread, just because we just never run it that way anymore. But I certainly could if we -- if people were interested in what that is. Now, once we optimized the matrix-matrix multiply, the computational [inaudible] actually ended up being a combination of assembling the data for the context windows and the nonlinearity. And so we addressed that by doing a fast approximate exponential on the 8 core version, and then fortunately CUDA has a built-in paralyzed exponential, and so we use that on CUDA. And so with those optimizations, we're now compute bound for any large net, which is what we typically run. So, you know, as long as you're running a net with thousands of hidden units, we're completely computation bound on these platforms. The next topic is something we're working on currently called a tonotopic network. And it's called tonotopic because we organize the features roughly by tone. So these features down at the bottom are low frequency. It's the how much energy is there in the low frequency part of the spectrum, and the higher up ones are how much energy is in the higher parts of the spectrum. And we organize the neural network in a structured way so that we're computing directly on energy on frequency bands. So the reason we think this might work is, for example, imagine you have band-limited noise where at one frequency band there's something going on that corrupts it. The network could learn that and not get -- be interfered with other -- the clean part of the spectrum. Additionally, we've also found that combining things like this with other systems typically pretty much always works. So just having something that's a little different from the others -- from the other systems that you have gives you better accuracy. And as you can see, the structure of this is going to be a little different in terms of the computations. First of all, we've added an additional layer, but that's just a dense matrix multiply again. In terms of the input layer, then, the structure in the input layer -- I'm sorry, I should have stayed that way. One way of implementing this would be as a block diagonal matrix where -- the ones where the no connections are just zeros. But that's not very efficient. So in fact what we end up doing is separate matrix multiplies for each grouping. And that's just a more efficient way of organizing it. And then also there's the context window issue that I described before, which makes just the sort of organization of the data here a little more complex. You have to move things around a lot more. And that just takes a little more time. So we have already an existing single core version that we used at ICSI that got written by graduate students that graduated last year. And we've been working on the multi- and many-core implementations, are currently in progress. And we expect similar speedup. It's probably not going to be quite as good because you have slightly smaller matrices and doing several of them, but it should be pretty close. So next I want to go on to talking about the work of Jike Chong and other people -- other students at Par Lab on doing parallelized decoder. So this is the component that once all this signal processing stuff has been done and you have some features, it theoretically does a search over all possible utterances in your language and outputs the one that matches your data the best. Of course, it can't do every possible utterance; that's too many. So it does a huge search, a huge, pruned search. And in typical single-threaded -- singling-stream ASR systems that's the computational bottleneck is this search. And so it's useful to parallelize it so that you can get realtime response and just get a faster system response. The other components of this, the pronunciation lexicon, the acoustic parameters and the language model were generously donated by SRI for this project. They've been simplified a little bit from what we would run as a full system, both to fit in memory, because we have some memory constraints on the GPUs, and also because again we're concentrating more on online and realtime systems. So the full-blown SRI system does many, many, many passes over the data and sort of assumes you have the entire utterance available before you do any processing at all and it does adaptation and all sorts of other things. And so for this project, we're using simplified systems with fewer parameters and so on. So that's just a little different. And then the work involved exploring a design space of how to design the decoder for different architectures, potentially for different architectures. The inference engine itself is a highly hierarchical structure. The outer loop of it is organized times synchronously. This isn't the only way to do it. Other decoders do it differently. But basically we're moving forward 10 milliseconds at a time. And so you're looking basically at time synchronous guesses at what the stream might be and updating them all at the same time. And that's what the outer loop looks like. The next loop is -- consists of three phases. The first phase is basically the computation of the acoustic models. It's a Gaussian mixture computation that you're doing for all your current hypotheses. And this is completely compute bound and embarrassingly parallel. It's -- well, not embarrassing. It's a little more complex than that. But it's reasonably easy to parallelize, compare to phase two and three, which involve keeping track of your updates. And, again, because you can't search every possible hypothesis, you have to just keep track of the best ones. And to do that the hypotheses have to communicate with each other to keep the updates in order. And I'll talk a little bit about that in another slide. The hardware platforms that we ran this on are a multicore conventional and then also the many-core NVIDIA platform. And I think people are probably familiar with these architectures already. But you can see what the sort of performances look like. And so this slide is sort of the punch line of the best system, and I'm going to go into a little more detail later. And it's a breakdown of the time spent in the compute intensive phase versus the communication intensive phase. So basically the whites are the phase two and three and the black is phase one. And as you can see, the speedup varies as you go from the sequential implementation to the multicore and the many-core. All the phases are speeded up, but the communication phases are sped up less than the compute intensive phases. And so this does indicate that without some additional effort we'll probably start hitting some diminishing returns on the communication phase, since we're already up to 50 percent in the many-core case. And then in terms of why the inference engine -- Jike calls it an inference engine. We typically call it a decoder. Same thing -- why it's challenging. The way that the decode runs is on the graph called a weighted finite state transducer. And the way you should think of this is it encodes the sort of words you can say and the patterns and their pronunciations. So this might be, for example, op-tics, and this might be F-ics. So it's sharing the suffix. And this might be optometry. And you're proceeding through that as a hypothesis. You can think of a particular hypothesis as just being a spot in the graph that encodes the path it took to get there. And the update step is just to move one step forward in the graph, either going to the next speech sound or staying in the speech sound you're in. And so the graph is not regular as you can see, it's just guided by what your dictionary looks like. And then where your hypotheses happen to be on the graph depends on what happens at runtime, how you say the words, what the acoustic environment is and so on. And then of course there's the problem of synchronization between the phases as you update the hypotheses. So we investigated a particular design space on -- for the traversal, and for the computation, and this slide describes that. So the first choices are sort of how you bunch up hypotheses to go to the different cores. You can either organize them so that you're going over arcs or you can go over nodes. And of course there are a lot more arcs than there are nodes. And so what you could expect is that as you get more SIMD resources, more parallel resources, in all likelihood the arcs implementation is going to be better than the states implementation. But it does depend on how many threads you have -- excuse me, how many -- yeah, how many independent threads you have. And then the -- the other design axis is on traversal methods. And this is basically how you communicate and update the current hypotheses when you're moving from one state to the next. And you can either do that by propagation, which basically depends on some hardware locking, or you can do it by aggregation into the final states, which in the current implementation is done with some software locks. And so we actually develop four different algorithms and ran them on the different platforms to see which one would work best on each platform. And so this slide basically shows the breakdown in the different phases for the sequential implementation, and then here for each of the other implementations. And basically along this axis, if you have platform atomics, you do better in this axis, and as you have more SIMD you do better on that axis. And so on the multicore platform, these are what -- the results we had. And as you can see, you definitely want to do the propagation rather than aggregation on the multicore setup, but there's not a whole lot of difference between the arc based and the state based, because you don't have a lot of SIMD lanes. And so it's not a huge advantage to do one over the other. And so state-based propagation won by a little bit over the other algorithms. The picture is slightly different on the many-core, where the breakdown looks more like this. And as you can see here, the arc-based propagation won by a lot. And you're still better off with propagation because both platforms have good hardware locks, but on the many-core architecture you have a lot more SIMD resources. And so there's a much better win on using the arc-based propagation. And I think that's all I want to say about the decoder. And then we just have summary slide. So as we talked about before, speech recognition works pretty well under some conditions but still can fail quite badly if you have noise, if you have accents, if you have unexpected conditions in various ways, human-to-human communications. And by massively expanding the front end, we hope to address some of those issues to do better in noise, to be more robust to unknown situations, to be better with less training data. And then the Par Lab approach has helped quite a built on making the individual components much faster. And so the -- enabling us to do a lot more experimentation as well as to run in real time. And then the final point is how can Par Lab tools enable more experimentation. This gets again at the whole point that right now a training of one of our big nets can take a month. And what we'd like to do is be able to train up a hundred nets and be able to do it in a day and be able to experiment with many, many more features. Morgan, did you want to add anything else to the summary slide? >> Nelson Morgan: Yeah, I guess the other thing for the last bullet is that right now we're using our own homegrown software because these things are proceeding in parallel, the software frameworks, et cetera. Development is happening in our lab and our work in speech. We're not using any of that stuff right now. And it'd be nice to see if -- as that stuff develops further if it could be useful for us. >> Adam Janin: I'm talking about the mainstream sort of other ICSI researchers. >>: Do you know whether [inaudible] pare down of the SI decoder's components to enable all the parallel completions done by Jike Chong? >> Adam Janin: Yeah. >>: How much does it reduce the performance as you pare down those components? >> Adam Janin: I could e-mail that to you; I don't have it off the top of my head. It was I believe around 4 or 5 percent. >>: [inaudible] harder to parallelize. >> Adam Janin: Well, again, the issue, most of the reduction had to do with just fitting it in memory on the NVIDIA card. We wanted to be able to have a fair comparison between the NVIDIA card and the multistream system. And so we just had to take, for example, language models that were much more heavily pruned than we normally would. So the language model is just smaller than it would be for our full-grown system. >>: So which parts of the application do you really want to run locally on your handheld as opposed to the cloud? It seems like the training might be done on the cloud. >> Eric Battenberg: Yeah. Absolutely. So the training almost certainly you do all your training on a cloud. You might do some adaptation locally. And then in terms of runtime, ICSI has participated a little bit in some of the distributed speech recognition tasks, the idea being that you do some of the computation on your phone, basically all the front-end stuff, and then you ship that all off digitally to a back end rather than just either shipping audio from your phone which introduces its own set of artifacts or trying to do all the recognition on your phone. But I think as the processing power gets greater and greater, you'll see more and more of it migrating all the way down to your phone. For some tasks you can do that already. I think most of you probably have cell phones that will do voice dialling already. And that works perfectly fine because it's right against your mouth, it's in a quiet environment, and the vocabulary is small. But if you put the same thing on the center of a table and you try to record a meeting, you wouldn't get half the words right with the current technology. >>: Let's thank our speaker and take a break. [applause]