>> Sumit Basu: It's my great pleasure today to introduce Roger Dannenberg from CMU. Most of you here probably already know that he's a very big figure in computer music. What you may not know is that he's worked in a huge variety of aspects of computer music throughout his time there. Before I even get to that, he has a -- so he's an associate professor at CMU, in computer science and also works for -- has an appointment in the art department. But he has worked not only in -- so let's start with the programming language stuff. So first he worked on programming languages for music and for sounds, and this has brought some of the first functional programming language ideas into sounds, and this is stuff which is still used today. He's worked -- he's worked on automatic teaching systems for music like the piano tutor system I believe it was called, which would take a year's worth of music instruction and compress it into something like 20 minutes of time. >> Roger B. Dannenberg: 20 hours. >> Sumit Basu: 20 hours. 20 minutes would be great. But 20 hours is still pretty good. And which was very revolutionary. And more recently he's done a lot of work on music analysis, looking at structure analysis as well as trying to recognize genres and styles of music. In addition to all of that and a being a professor at CMU, he's also managed to be a world class performer. He plays the trumpet, and he's played with a lot of amazing people, has performed at the Apollo Theater in Harlem and all kinds of other places. Just an amazing guy, and I'm really, really happy to have him here for the day and give this talk. >> Roger B. Dannenberg: All right. [applause]. Well, thanks very much. It's great to be here. So what I'd like to talk about today is work -- a little bit about work that I've done in music understanding and really try to focus today on an application area that I'm moving into and so the main point of this is I think music is for everyone. MLS of people are practicing musicians, so I'm not talking about just listening to music but actually performing music. And this is that just a startling number of people do at some level of proficiency and many, many more people would like to do that. And so I think that computing is a way to make music performance more fun, more available, higher quality, and it's a big area to explore, and I spent a lot of my time doing that and thinking about that. So what are the problems that we can solve in this area in well, one thing and a very big thing is just practicing is a lonely thing for most people. You don't really want to practice with other people. You know, the funny line that when a musician makes a mistake the other -- the guy next to him if he wants to say a joke says hey, practice at home. And so -- but also musical partners are not always available, so if you want to play music in an ensemble and you're not doing that daily in a professional basis, it can be hard to get people together and make sure all the parts get covered. Another issue for amateurs is that while 100 years ago all music was live and the quality of music was not necessarily all that great by today's standards, now we have recordings of the best symphonies, the best solos. We almost don't hear anything but a virtuoso playing. And so if you're an amateur, you've got a much higher bar setting the standard for music. And so we might think about ways that we can help amateurs achieve the quality of sound that they -- you know, that they imagine that they would like to do. So what if we had computers that could play with us, editors that could fix our mistakes and new forms of personal expression? And so this is the direction I want to go. And I think an important way to get there is this area, this research that I call music understanding which is the recognition of pattern and structure in music. So that's a very -- intentionally a very broad kind of definition, all encompassing almost. And so what do we mean by pattern and structure? Well, I really mean structure in many different levels. So we have what we might call surface structure in music which are things like pitch and harmony, loudness, identifying notes. All right. So these are very, very specific not very abstract concepts. And then there is much deeper structure in music such as the relationship between phrases, the association between printed music and music audio, emotion in music, both understanding what emotion is -- music is trying to express and also understanding how to express and emotion through a musical performance. And that's just part of expressive performance, which includes not only emotion but other kinds of musical issues. And so trying to get computers to deal with all of these aspects of music is my biggest interest and really the main goal of music understanding. So there are some tasks in music understanding. This is not all of them but just generally I would say in music understanding we work on problems related to matching musical sequences and so by music sequence I mean both music notation or something like a MIDI file, symbolic and performance information and also audio information. So we have problems of matching symbolic scores to audio, we have problems of matching audio to audio. For example, find covers of a -- given a song, find other artists who have recorded the same song but in a different style. So those are some different kinds of music sequence matching problems. And also searching for music. So query by humming systems where you hum a tune and look for it in a database. That's another music sequence problem, music recognition problem. Okay. So another set of problems have to do with parsing music and that includes classification, understanding either genre or emotion or identifying instruments. It includes segmentation and structure. So for example find where the beats are, find where the notes are, find when the -- find the chorus of a pop song that gets repeated several times of a pop song that gets repeated several times so find that section of music that tends to repeat. And so those are all -- I mean, I put the word parsing in quotes but generally there's a whole set of problems associated with that type of thing. And then finally there's expressive performance which I mentioned before and expressive performance really has to do with the gap between taking a piece of music notation which where having at least in western notation things are quantized to beats and, you know, absolute rock steady metronome tempo and you feed this into a synthesizer and you get something that sounds like an early cell phone performance. And it's not very musical or very interesting. And so there's this big gap between what -- between that just literal rendering of note information and what a human musical performer can do to make the music expressive. So I'd like to start by showing this -- by showing this video and it probably at least a couple of you have seen this, and I apologize for showing it over and over again, but this is work that I -- some of the first work that I did in the field, and I have I it's still such a good example of what could be done that even though I could show you more modern version, I think it's more impressive to see here's stuff that actually ran in 1985 and it was kind of an applied music understanding system. So let's switch to -- whoops. Okay. Here we go. This is a demonstration of a computer accompaniment system. I'm going to begin by loading a file into the program. The file contains a score, and in the score there's a part that I'm going to play on trumpet that's also displayed on the screen. There is another part which is accompaniment part in this case composed quite some time ago. The purpose of the program then is to listen to my performance, a live performance of the solo part and to synchronize the other parts and play them along with me. [music played]. The computer accompaniment system is based on a very robust pattern matcher that compares the notes of the live performance as I play them to the notes that are stored in the score. To illustrate how well the pattern matching works, I'm going to deliberately create a nightmare for the accompanist by playing lots of wrong notes, changing the tempo, and missing some notes. [music played]. Okay. So just a quick word about how that works and what's going on. There's the trumpet performance is going into a this input processing box which really is doing pitch recognition, and the computer, because I loaded up this score, has a sequence of what I'm supposed to be playing. And those get matched together using a -- kind of a modification of longest common subsequence string matching dynamic programming algorithm modified to deliver results incrementally in realtime. And the -- then the accompaniment part is also stored with the score. And I think the most interesting thing about this structure is that this matching process and the accompaniment performance process are loosely coupled processes. So the information practice matching is constantly going over into the accompaniment performer, but the performer doesn't play exactly what is matched because it's trying to do a musical performance. And it knows or the system is based on the fact that an accompanist is really another musician so that accompanist has an idea what the tempo is and how to perform things musically and so these are kind of loosely coupled performances. And of course the output from that just drives a music synthesizer and so that's what you're hearing. Okay. So the -- this work is behaved on or that video and that whole accompaniment system is based on the idea that music is played with expressive timing and the way that you should accompany a performer is to listen carefully to the notes that they play and listen carefully to their timing and adjust to comply, to synchronize with that. So if you speed up a phrase and slow down a phrase, then the accompaniment is going to follow that. And that is a bit of a simplification, but it's more or less the way we make western classical chamber music. And so this system is good for that. And lots of people -- this has been commercialized. There's over 100,000 students using practice systems, able to practice at home doing this kind of music. And that's all great, but long after doing this, I was playing in a -- playing with a rock band and just getting a little bit frustrated that I couldn't -- there were some notes that I was just, you know, at the end of the night I'd be really tired, and they were hard to hit and the band played so loud it was hard for me to come up to that volume. And I was just about, you know, every once in a while, it would be really great if I just had some samples of my playing when I'm feeling really good. And if I could just cue them and get them into the sound system, the band is so loud nobody would know that it's not coming acoustically out of my trumpet, we could just come out of the speakers. And that would be really great. And then if you could do that, then I'd say why stop there, you know, instead of two trumpets and a trombone and a sax we could have three trumpets and two trombones and a couple of saxes. >>: Nellie Benelli. >> Roger B. Dannenberg: Yeah. Well, except that we'd be in control, and we'd be performing. But it's -- yeah, it's close to Nellie Benelli. And so my second thought was hey, wait a minute, I've already done this. I know how to build a accompaniment systems, this is just an accompaniment problem. And I thought yeah, I should do that. And then so the third thing -- thought that I had was, wait a minute, if I really want to play effectively with a rock band, the really important thing is timing, and the -- also to do the kind of music that we're playing, the trumpet -- the accompaniment system can't really follow another trumpet player because we sit out for 16 measures or more and then suddenly we just come in on this entrance. And the entrances are -- well, the whole band doesn't necessarily follow a strict score from beginning to end, because there might be some improvisation, there might be mistakes you have to deal with. And also, even following -- if you didn't want to follow a trumpet, maybe you want to follow a guitar player or something. But the guitar player is playing off of chords. He's not playing individual notes that we can actually recognize. He's not doing that consistently from one performance to the next because he might just say okay, I'll voice this chord differently or I'll use a different strumming pattern tonight, and who knows what could happen. And so the more I thought about it, the more I realized this is a completely different problem. And but it's a big problem. I mean, it's big in the sense of lots of subproblems, lots of twists and also lots and lots of applications. So I call this problem the -- well, performing popular music with computers. And I don't really like the word popular but I use popular because that seems to be the best word that characterizes a very wide class of music where tempo is very steady. So I really could call it performing steady tempo music with computers, but somehow that seems a little awkward. But I mean not the pop genre but I mean rock -- could be rock, pop, techno, jazz -- most jazz is in this category, most folk music is in this category. So it's actually very, very wide class of music. So the goal is to create more musical opportunities for people, to create ultimately new artistic directions for musicians. And when I say performing popular music, the performing part implies that we're going to do live -- this is for live performance of humans and computers. And the popular part I explained means steady beat. So the big research question is how can we coordinate human and computer performers of popular music? And I have an example to show you, another video. So this is work -- this was sort of the first implementation of a system that we took out for a test drive in April of this year with the Carnegie Mellon jazz ensemble, and we set this up as a initial problem without even implementing the system. The first thing I did is I went to a really wonderful arranger, John Wilson, who is the guy conducting here, and asked him if he would write a big piece for jazz ensemble and a 20 piece string orchestra. Or the 20 -- he actually gave me the number 20. I said how many strings do you need to do something that would just sound great? And he said give me 20. And I said okay, 20. Not really knowing how I was going to do this, that's what we decided. So he went off and did arranging and I went off and started writing software. And we talked a lot about how we were going to recognize and what he could and could not do not arrangement. But this is what we came up with. [music played]. Okay. I'm going to pause for just a second to explain what's going on in case it's not clear. So we do have one string player who's live. That's Dave Pello [phonetic], who is actually the director of the ensemble which is about to come in. And all the other strings that you hear are up here. These are eight studio monitors. They sound incredible. And what they're playing are -- well, I'll tell you more about how we did this after you listen to a little bit more. But the high, especially the high string violin section stuff that you hear is all coming from a computer. And the rest of the band is piano, drums, bass, saxes trumpets, saxes, trombones and trumpets. [music played]. Okay so this goes on for a long time. I'm going to jump to the end. So he's improvising. Yeah. I hate to -- so this is also great tenor improviser in the middle place for a while. So the strings are in the background. [music played]. Okay. And I'm going to jump to the end I think if I can find this. [music played]. Whoa. That was a little too much. [music played]. [applause]. Okay. So -- yeah? >>: I'm finding it hard to appreciate how much the system is doing without knowing kind of the rules of engagement on both sides. Like ->> Roger B. Dannenberg: Yeah. >>: Whose doing the improvisation, who is sort of [inaudible] music? >> Roger B. Dannenberg: Yeah. So that's -- I want to tell you about that. And I'll tell you in details about what the system is doing. >>: [inaudible] until the end of this, then almost at the end of the solo with the strings coming in. >> Roger B. Dannenberg: Yeah. Yeah. >>: So is that part of the score or is it -- is it waiting for some specific note for [inaudible]. >> Roger B. Dannenberg: Well, so that's -- so that's part -- I mean the part of the point of showing you some different parts of this video were to illustrate some of the problems that even though it's mostly steady tempo, there are things like at the very beginning the strings were sort of following a bit of a rubato performance by the bass and then at the very end things kind of came to a hold and then the conductor cued in the string section which plays this very fast thing. So yeah, so that's in the score. But the score says that there's going to be a little cadenza by the bass and he's going to hold and then they come in on tempo. So that's part of it. >>: What was one part in the very beginning that looked like and awesome thing. I don't know if it was releasing happening but it was when the cellist was silent and the exact moment that the cellist made his first sound the strings came in, it looked like, you know, within a subsecond of the same time. I didn't know how that was synchronized. >> Roger B. Dannenberg: Yeah. Well, so I'll play the beginning again so you can hear that. [music played]. Yeah, like right there. [music played]. Well, so he's actually conducting. So what's happening is the -- so we're using -in this case we're actually tapping in time, and I'll give you some more details, but there's no -- sensing is not by audio, the sensing is from a much more kind of discrete foot tapping and manual queuing sort of thing which makes the problem a lot simpler. >>: [inaudible]. >> Roger B. Dannenberg: Yeah, yeah. And so -- well, yeah, we're using image capture but it's through two human eyes that are watching the conductor. And the bass player is also watching the conductor. And even that -- well, so let's get back to the -- oh, yeah, good. I have a picture up here. So this is the implementation for that April concert. There's a foot pedal and based on foot tapping we're doing tempo estimation and sort of forward prediction of where the next -- where the next beat is going to be. We also have a little keyboard which just provides some extra inputs that are used for cues because what happens -- well, I'll talk about why we do this later. But anyway the cues and the beat prediction are sort of integrated to find a score position, and the score position drives variable rate playback of audio. And then what's happening on the -- for the variable rate playback, we have again 20 -- 20 channels -- 20 strings, each string on an independent audio channel because that enables you to do very high quality time stretching. The time stretching works by labeling each period of each instrument. So we have something like eight megabytes of just pitch labels. And so if you want to stretch a string out, let's say if you want to stretch by 10 percent and you take every 10th cycle, every 10th vibration of the string and copy it, copy that 10th one, so now you have 11 vibrations and if you -- if you do the copying very, you know, carefully, you can -- you can do that with no glitches and it's essentially artifact free, works really well. And then similarly, if you want to play 10 percent if the other direction, I'm not sure which -- if you're scaling the other direction, you can just leave out every 10th vibration and then you play faster. >>: [inaudible]. >> Roger B. Dannenberg: No. One of the band members was. >>: Who is also playing another instrument at the time? >> Roger B. Dannenberg: That was the original plan. And everything was built to do that. And it turned out that the arrangement did not call for vibraphones and the band has a vibe's player who is also an excellent percussionist, and she -- so she ended up doing the pedal so she would have something to do. And ->>: Is this kind of like the system watching the conductor? Is that what the foot pedal is ->> Roger B. Dannenberg: Yeah, yeah, and so she's listening to the band, she's watching the conductor. She's also -- she insisted that she be able to have visual contact with the high hat on the drummer. So, the high-hat's the little cymbal that is called a sock cymbal that goes up and down. And this drummer for this piece would normally play that on the beat, so it was helpful to see that. >>: Just getting semantic interpretation of a solo is about to end, isn't that ->> Roger B. Dannenberg: Yeah, yeah, so some of the cues are coming from the conductor, some of it's just she can listen and she knows what's happening. But there were a lot of -- a number of string entrances, may be 10 different entrances during the piece, and we -- I intentionally made it so that they had to be cued, they would not happen automatically. And the main idea there was if the band ever just really messed up and fell apart and got back together again or if she messed up then I or someone could kill the strings for that entrance, you know, reset things and then bring it back up and then it would make the next cue. You know, and in live performance, you know, people miss their cues, so -usually it's not the whole string section at once that drops out [laughter] and that would be bad, but that's -- you know, the audience would probably never know the difference. So that's what we always say. When those things happen. Okay. So another interesting thing about the playback is there are a lot of systems that have been implemented to do this kind of time stretching. Usually it's time stretching by a fixed amount over some interval of time. And if you use pro tools or some other editor, you select something and say stretch is 10 percent approximate. So what we're doing is actually doing the time stretching with interactive control in realtime. And one of the things that happens there is that because the stretching works on periods of vibration, the amount of -- you can only cause time to jump ahead or lag by these discrete intervals. It's not really a continuous process. And so if you have 20 strings and they're all playing different pitches and you're trying to -- and you're constantly kind of ramping them up and down in tempo to follow the band, then all these quantization errors are going to start piling up and the whole orchestra would start -- potentially would diverge. I mean, it's kind of a statistical process. So what I had to implement was -- so this PSOLA is pitch synchronous overlap add. That's the process for reassembling this audio. And so we monitor -- we have a buffer in front of the time stretching that is being filled from the 20 channels of audio into these 20 buffers. And by monitoring, you know, whether the buffer is getting full or getting empty, we can modulate the amount of time stretching that we're telling the pitch synchronous overlap adds to do. So it's kind of a servomechanism so that they all track each other and we don't accumulate any grounding error. So all that stuff's going on in realtime during the performance. Yeah? >>: [inaudible] an actual 20 string section performing that piece at reasonable ->> Roger B. Dannenberg: Yes. >>: Not generic [inaudible] but somebody performing that ->> Roger B. Dannenberg: Right, right. It's actually recordings. We did the recordings by first recording -- it's a little -- we actually did put down a click track, and then we had a rhythm section play with the click track so that they would be -- it's actually the same rhythm section so that they would give the kind of syncopated feel that the piece called for. And then we multi-tracked -- we had four or five violin, viola and cello players come in and multi-track the piece. So they're actually -- you know, we had at most five players, but we ended up with 20 tracks. >>: [inaudible]. >> Roger B. Dannenberg: Yeah. Each one on a separate -- so we were able with our studio, we were able to record three at a time in isolation booths. So -but it's so it's all totally isolated. Okay. So that's an example. What I want to do is now is talk about a -- the way I'm thinking about this problem. I see, you know, what we did in April is just a first -- a first, you know, prototype implementation to learn about the problems. And now I'm really thinking about other ways to solve, there's better ways to solve this, how to generalize this. And I think the way to think about this is to kind of analyze this type of performance, this popular music performance and think about just break it down into what are the elements, and once we have a more abstract way of looking at performance, we can think about how do we get the computer to do this. And so I have this thing I'm calling the performance model which really has two parts. There's a listening part and there's the performance, the playing part. And so on the listening part, here's what you have to do to play -- to play music with other people. First of all, you have to know generally where you are in the music. So, you know, are we in -- are we playing, you know, what piece are we playing, with we playing letter A or letter B, or is this the verse or the chorus, that type of thing? You also have to know the beat and the tempo. Because again, we don't synchronize by synchronizing notes, we synchronize by having some abstract notion of where we are in the piece, and the primary structure there is really the beat. And the fact that the beat is steady allows musicians to achieve a lot have synchronization even though they might all be improvising. So beat and tempo are really critical. The second thing is, and this wasn't obvious to me at first, but it's really important to know where 1 is. So what I mean by that is most music is in 4-4. Other music is in 3, or 5, or 7. But generally if you turn on the radio you'll here something in 4 which means the structure is 1, 2, 3, 4, 1, 2, 3, 4, and so when musicians say, you know, where is 1, they're talking about the downbeats of these measures. This other piece, da, da, da, da, 3, 4, da, da, those are all on 1, and so if you know where -- the purpose of finding 1 in music, again, something I never even thought about before, thinking about this problem, the reason that we have measures and we think about where one is and that's useful, is that these go by really fast, you know. If we didn't have measures, it would be 1, 1, 1, 1, 1, 1, 1, 1. And if you're going at that rate and you try to nod to your musician, okay, let's come in now, you're going to have 1, 1, 1, which 1 were you talking about? It goes by way too fast. I'll tell you another story. I just performed sorcerer's apprentice with a community orchestra last weekend. And that piece is actually written -- so that's the bum, bum, ba, ba, ba, bum, ba, bum, ba, bum. This -- it's in 3. It goes 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3. So this is 1, 1, 1, 1, 1, 1, 1, 1, 1, 1. And so for the trumpet player, you look at the score, and you play a little bit and then you're out for 50, 60 measures. 1, 1, 1. Counting for dear life. And I finally had to -- I mean, I just after the first rehearsal I said I just can't do this, I can't count 1 when 1 goes that fast, and so it turns out the piece really is structured into groups of three measures. I think the notation is just wrong. But you know, who am I to tell Paul Ducott how to orchestrate? But I actually had to notate in my music groups of three and that enabled me cognitively to deal with the stuff going by. So 1 is really crucial. And that's if you're going to communicate and synchronize with you're players, you do it on the basis of measures. So that's the point. And so you have to know where 1 is. And you also have to listen or sense what other players are telling you. So the other player gives you a nod or the conductor gives you a downbeat or whatever, you know, somehow those cues are critical. And the reason cues are critical is that in this kind of music, people do improvise a lot. They might -- the vocalist might vamp for a couple of extra bars, somebody might forget to come in. You might have a transition and the rhythm section forgets how many measures it is and they add some extra measures. And so people do in live performance kind of look at each other, gesture to each other, yell at each other. Whatever it takes to stay together. And so while all this perception stuff is going on, we're also performing music. So that means you have to tempo regulation, phase regulation, which means when I know -- just because I know where the beats are, that's not all there is to playing. If I'm a bass player, I might play a little bit ahead of the beat. If I'm a drummer, I might feel like I'm playing exactly on the beat. If I'm playing trumpet with this jazz band that I play with it's really -- it's crucial that the horns actually play a little behind the beat. This is a big band and it's just the style of the band is that we play behind the beat. And it makes it sound hip. It sounds cool. It's jazz. And sometimes guys will come and sit in with the band and they're more classically trained guys, and they'll play right on the beat. And they play just -they're great musicians and they play every note right and they're on the beat, and we never invite them back [laughter] because it just -- it just doesn't swing. So phase is really critical. And not very well understood and not very well studied, but so there's a whole set of problems there. And then there's dynamics, how loud you play; and expression, which is kind of everything else. So let's talk about some of these problems. The first one is know where you are in the music. So how do we do that? There's one technique for that that I think shows a lot of promise is work on audio alignment. So I've done a lot of work in this area, other people have too. This is an example -- I pulled this from some classical musical audio alignment. This is Beethoven's Fifth Symphony First Movement, and the horizontal axis is audio rendered from a MIDI file and the vertical axis I think is the Philadelphia Orchestra, but it's a human orchestra playing. And using kind of a forced alignment dynamic programming and some features that are -- I can go into details called chromogram, but they're basically extracting spectral features. We can find the best alignment of these sequences of features that we extract and actually match up the -- match up two performances of the piece. This is screen shot from audacity that I'm doing some work on to integrate these algorithms so that we can feed in a MIDI file. This happens to be a Haydn Symphony with a recording and click align and the system will automatically match up the symbolic representation of the music with the audio. And so for doing live performance, the idea is knowing where you are, you know, is this a verse or a chorus? I think we can take existing recordings of the band, you know, maybe made in a rehearsal and quickly mark them up with, okay, this is section A, this is section B or this is a verse and this is a chorus and do some kind of realtime version of this matching. But there's a whole field of research here for following popular music performances in realtime and giving feedback to musicians about where they are. So that's one issue I hope to work on. Knowing the tempo is the next problem. So the first thought you might have is well, we'll just use audio analysis and find the beats. And people that have worked in that area know that we've made a lot of advances but it's still really hard. And if you want to do this reliably enough that you trust your computer to get up on stage with you and play and always know where the beat is, I find that pretty frightening. So that's why I've resorted so far to this foot pedal thing. I think that after doing this concert in April and after looking at the high hat and see how consistent it was, I think there's a lot of room for combining multiple sensors, but even accelerometers on cymbals and drums and use drum sensors and use beat tracking algorithms and if we put all of that stuff together with a lot of sensor fusion sorts of technologies maybe we can build something that's really, really reliable. And so that would -- that's a new direction to pursue. >>: [inaudible]. >> Roger B. Dannenberg: Conductor. Yeah. There ->>: You can see that there's that movement. >>: Or even if he has some senses in his hands. >> Roger B. Dannenberg: Yeah. I've actually captured a lot of conducting gestures with an accelerometer, wrist worn accelerometers, and other people of used other kinds of sensors. And so far the conclusion I've -- I'm reaching is that conductors -- well, that that's -- that might be an interesting source of data, but it's not enough, and it's not very reliable. In fact, I've also done interviews with conductors where they conducted and we shot video and then we asked them to review the video and tell us what were you doing, what were you thinking, and at least in the classical world conductors will tell us, you know, I'm not trying to indicate tempo or anything here, and I'm trying to fix this problem with the dynamics of the piano player. And then the conductor will say, okay, there I got his attention and we fixed that, and I went back to -- but, yeah, it's very surprising. I think the common assumption is the conductor is up there to tell everywhere 1 is and to tell them what the beat is. I think that's not at all what actually happens. >>: [inaudible] farther back away from the popular music direction you're talking about ->> Roger B. Dannenberg: Yes, that's right. And so knows popular music has no conductors. >>: [inaudible] as well as the not conductor performances. Imagine particularly in a fusion world watching the movements of the violin bows, all of them, or watching the moves of the hands of the guitar players, all of them, and being able to say okay I've got a pretty good idea of the system based on 35 sensors or 350 sensors or 1,000 sensors that I know where we are, and you know, I want to follow the guitar player now. >> Roger B. Dannenberg: Yeah. Yeah. And that's exactly the kind of approach that I think that we need. And so I'm very interested in pursuing that. Maybe not initially with 300 sensors, but, yeah, I think that's exactly the right idea that the beat kind of moves around among -- I mean, where is the good place to find the beat is fluid and so we need to not only have algorithms to look for the beat but to try to figure out which data is reliable and which data is not and so this is the whole fusion problem. >>: [inaudible]. >> Roger B. Dannenberg: Yeah. Yeah. Okay. So foot tapping, sensors, a lot of room for machine learning in here. And let's go on to the next topic is where is 1. So in the music information retrieval world finding downbeats has not been a real problem that people have identified. But I think there are some techniques, for example finding out where our chord change is happening, looking again at player gestures and then, you know, what we did, the reliable, simple but kind of distracting way of giving cues is to just -- is press the key or some kind of out of band signal, which for a percussionist is no problem. But you kind of have to have a free hand, and you have to be thinking about it. The next thing is communicating cues with other players, and so the problem here is not how do you know where you are but how do you know you know where you are? And musicians do this all the time in very subtle ways sometimes. So for example, when I'm playing with other trumpet players in a band and we're typically playing different notes but the same rhythms and so just standing next to someone who's breathing, inhaling, getting ready to play the note at the same time you are, you have this kind of sixth sense of that you must be doing the right thing because you're breathing at the same time that they are. And it's something that you don't really think of consciously except when they have an entrance that's one beat after you and you breathe and they don't because they're doing the right thing, it's very easy -- you know, you really have to either trust yourself and go with it or you can get faked out or you just miss it in rehearsal and next time you make a note and remember that you have -- this is your solo and you don't have to synchronize with somebody else. So those kinds of interactions I think are very hard to reproduce with computers, and so I'm not sure how to do that. But I think it's going to take some combination of visual displays. We talked about tactile displays like, you know, put a -- something that would tap you or shock you or buzz you or something. And people, there have been some instances of people using I guess little vibrators attached to performers so performers so could signal each other, which I think is -- without making sound, which I think is a really interesting idea. But okay. So there's kind of a whole research area here of this kind of cuing during musical performances. And then we get to performing. Now that we've thought about listening and sensing, what about playing? And so part of this is tempo and one of the problems is given prior beat times how do you know the current tempo and how do you estimate time of the next beat? So initially I thought, well, we're dealing with steady tempos, so this is not really a problem. And it turns out that as I've been measuring tempo of live performances, I find a lot more variation than I expected. I mean, even a jazz ensemble playing with a drummer keeping time will fluctuate tempo up and down 10 percent based on whether the soloist is getting excited. Or who knows what's going on, you know? I don't really know, but I see the data and see tempo drifting a lot. And so this graph is an example of just saying let's take N previous beats and do a linear regression and call that the tempo and use that to estimate where the next beat is. And let's figure out for different sizes of N how good the prediction is. And so this is what this graph shows. So this says that down -- if we don't use -if we only look at the two previous beats to guess the next one, our prediction error is kind of high because these are not true beats but foot taps, and there's jitter and noise in the tapping. If we smooth over some more beats like we go up to five to ten beats, then we hit a minimum here where the prediction is pretty good. And then at some point as N gets large, we're actually smoothing in data from a time when the tempo was different. And so we're using the wrong tempo to estimate the next beat. And as this number gets large, the error gets -- well, this is up to 45 millisecond of typical -- well, actually I think the standard deviation. But anyway, we're getting large errors if the window gets too large. And of course, you know, window size depends on the characteristics of the players and, you know, how steady things really are. >>: Is this [inaudible] different genre, is it something specific to ->> Roger B. Dannenberg: Well, this general U shape I think is common, but the -- you know, it would -- the minimum will move to the right as the music becomes more and more steady. And so it really depends on the players. Probably depends on the genre, too. >>: Seems to be two to three down beats, right? I mean, if you get to the two beat thing you're basically not even capturing you know two measures. So it seems like that would be outside ->> Roger B. Dannenberg: Yeah, so most -- well, actually, you know, so this curve looks like the minimum is at six. There's some other data we looked at where that number got a little bit larger. But, yeah, I think in the five to ten range is typical. Yes? >>: Does your synthesis system have a delay in it, you have to sort of play -start to -- initiate the player early? >> Roger B. Dannenberg: Yeah, you do. And so that's -- that's why we can't really wait for a foot tap or anything, we want to predict it. And actually, it also turns out that if you look at past data again with popular steady tempo music, if you look at past data, you can predict the time of the next beat more accurately than someone can actually tap on that beat. So when they're tapping, they're getting typically around 30 millisecond standard deviation around the true beat, which is unknown but we have some interesting ways I could tell you about later for why do we know that it's shirt milliseconds when we don't really know where the beat is. But we do know that. And so using this data, we can get -- we can shrink that town to, you know, more like 20 or 25 milliseconds. Yeah? >>: [inaudible] there were a couple of string entrances from robato sections. Did your vibraphone/foot pedal player learn to anticipate by audio lag, or did we just not notice? Do you know what I mean? If the vibraphone player signal -- to tap the pedal or press the key maybe is the case there right on cue, we'd expect to see here some audio lag unless that person had kind of gotten good at it and -- >> Roger B. Dannenberg: Oh, yeah. >>: And 25 milliseconds or whatever it is that ->> Roger B. Dannenberg: So what we did in this piece was the strings always -the strings never come in immediately when the cue is given. So the cue is always given ahead of times. Sometimes that had to be done on ->>: [inaudible] tempo beats in between the cue [inaudible] the string entrances? >> Roger B. Dannenberg: Right. >>: Okay. >> Roger B. Dannenberg: Right. And sometimes things were basically hard wired so that there wasn't actually a separate cue, it was just tapping, which was the scariest part, but it really wasn't such a problem because the tapping interface is so reliable. But, for example, the conductor would go 1, 2, 1, 2, 3, 4, and then -- and so when conductor did 1, 2, that enabled the tapper to get the tempo in her head. And then when he goes 1, 2, 3, 4, then she starts tapping. And I think she would only tap actually I think two beats there and then the strings would come in. >>: [inaudible] that you placed on the [inaudible] they shouldn't just be wild entrances that didn't have time for say count before them? >> Roger B. Dannenberg: Yeah. Hey actually -- you know, we talked about all this, and he went off and did what he wanted, so [laughter] I ended up -- I mean the only concession we really had to make is that between the slow section and the fast section he wanted to just go something like, you know, 2, 3, 4, boom and have the band come right in and I said no, you can't do that because the strings have to be there. So you got to give me two measures of count between the slow section and the fast section. So -- but otherwise all this kind of pseudo-robato stuff, we agreed that we weren't going to do that. And so, you know, we ended up -- it actually wasn't that much robato, it was just the way it was written it kind of sounds robato. >>: [inaudible]. >> Roger B. Dannenberg: Yeah? >>: On the previous slide so when you why kind of adjusting, do you have some mechanism where you were kind of like catching up if you were making a mistake or -- I mean like calibrating the window size live based on ->> Roger B. Dannenberg: No, and that's something we've tried looking at, so there's -- for example, there's a formalism called a switching common filter that tries to look at whether your model is fitting the data, and when it's not you could switch to another model. We haven't had luck with the switching models yet, but I think that's possible. And I think that human players are doing something like that, that when -- when the assumption is steady beat and somebody needs to change the tempo, it's very common in a performance with players that somebody does something, they go like let's speed it up or the -- you know, the band leader will say to the drummer like, you know, to something. And so that's kind of an out of band signal that's telling everybody, okay, reset your tempo estimation. And that seems to be necessary for humans. I'm not quite sure how to do that with machines, but that's a problem. Okay. Let me just -- I think I talked about this a little bit. Oh, yeah and I want to get to -- okay. So the point here is there are lots of interesting problems. And we could go through lots more of them. I want to -- I want to just mention some other work with this example. This is -- I talked about this problem of professional, even virtuoso recordings are things that we have to -- that amateurs have to contend with. Everyone hears them now. And so one thing I think we can do for amateurs, especially, but very useful for professionals is use computers to fix up their recordings according to scores or other information to make them sound better. So people do this by hand all the time, but it's extremely tedious. And so I've started looking at maybe we could automate a lot of this process. So this is a trumpet trio that I performed and I intentionally played this without a click track. There were some long gaps where you had to just count and then come in and a hope you were right. And I knew it would fail. And so this generated kind of a what in the business we call a train wreck. So I'll play that, and then I'll play the editor output. [music played]. Okay. So -- and this is a picture of some of the wave forms. Like in the score, this note and this note and this note should actually all be aligned with one another and they weren't. Okay. So I took the machine readable version of the score, which, you know, could come from just a MIDI file, and the audio all as separate tracks, and I fed them into this little prototype editor that tries to look for timing discrepancies and fix them up. It also looks at intonation. And if something is generally sharp or flat, it fixes that. And it also looks at dynamics and tries to make everything balanced. And so here's the output. There's still definitely some problems, but -- and if I thought of -- if I realized what was going to happen in advance, I could have sort of made a different mode in this editor to handle this too. But I wanted to be able to claim that I ran this stuff through the editor, and this is what you get with no correction and so here we go. [music played]. So better intonation, better synchronization, better balance. I mean, it's really remarkable to think you could just take that initial recording in a totally automated process come up with this. So my vision for amateurs and even professionals in the future is something like spelling correction where you put all the -- you put the audio up on the screen and a little window pops up and says you came in early here, to you want me to fix it? And you say yeah. And you just go through the score in minutes and make something that sounds like it was recorded by, you know, top studio players or something. Okay. So let me just wrap up and say that performers do two things. They gather information about location and tempo, where is 1, they get cues from other players. And performers play music. And so there are problems of intonation and dynamics and synchronization. And each one of these is an interesting problem from signal processing perspective, from the machine learning perspective and computer science and from music. And so -- and if we can do all of these things and put it together, then we'll have a system or kind of a whole new product category that just millions of people will find interesting uses for. And so that's -- I've shown you a little bit of work that we've done to get started, lots of open questions for the future, and that's what I wanted to share with you. So thanks for your questions and attention. [applause]. >> Roger B. Dannenberg: Yes? >>: When you introduced the bottom of -- moving into popular music domain you hinted that the problems are score following for pop music where there isn't really score, so score following -- like following the guitar. The sequence of guitar, chords that are loosely interpreted. But the work you actually mostly -there was a score. There was a fairly rigid score where there might be improvisation injected but there was a fairly reduced score. Have you actually moved down that direction toward following the loosely scored pop structure? I mean following guitar chord changes and in certain pop settings? >> Roger B. Dannenberg: Yeah. So one thing I can tell you is this score alignment stuff that we've done, we've tried this on rock tunes like Beatles tunes, for example. And what we found is that what we would align MIDI files with actual you know Beatles recordings, audio. And very often the MIDI files were not really transcriptions of what they actually did the. They would sort of be -- be almost like a cover tune but rendered in MIDI. And so in spite of that, we found that we could do alignment. So that would be cases where the vocal in the tune would be one melodic line and somebody sort of did an arrangement for MIDI and they harmonized it. And so they were actually injecting major melody notes on non-vocal instruments. There would be more phrasing and improvisation in the audio vocal line than there would be in the notated score. And in spite of all that, we can actually do alignment. Although when we do that, we're doing forced alignment, not sort of incremental score following. And so it still remains a question of, you know, what's the latency going to be when you do the alignment? Yeah? >>: Yeah. Have you thought about ways you can have the computer give feedback back to the musicians? Because it seems like you're just looking one way with what you're doing. And I know there's some bands that will have like a drummer play with a click track live to get that end figuration and it seems like if you could combine both directions you might get some interesting results. >> Roger B. Dannenberg: Yeah. Yeah. Well, that's what -- this was sort of a little prototype just imagining what an interface might look like. And so part of that is sort of the 1, 2, 3, 4, would blink on a display so the computer could feed back where it thinks it is, and the computer could be updating measures. The idea is that as a musician -- I mean, we know from some experiments that we've done and some evaluation of some interfaces that if you ask a musician to read a score and you put a display right next to it or something updating a lot of stuff that pretty much the musicians can't think about both at the same time. And even if they know the music and they're just looking at the display, they'll look at maybe the 1, 2, 3, 4 and they'll never, ever see the measure number. And so it's just very hard in realtime to think about music and keep track of multiple things. And so that's why I think it's such an interesting research problem. You do the obvious stuff and ->>: [inaudible] those random cues that happen two or three times on a song. >> Roger B. Dannenberg: Yeah. But I think what's actually going to happen is that you need some kind of display that may be that doesn't work the first time but which is something that musicians can learn to use, and if you present the right information and they know the information's there and they know how to access it, then when they're playing and they want that information, that I will be able to just glance at it, the same way that if I'm not orchestra and I've counted 30 measures but I'm not quite sure if I'm right or not I can turn to the guy next to me and go like this, and that means are we at measure 32? And he'll either say yes or -- I mean, he'll either do this back to me in some subtle way or he'll say -or he'll realize that I'm wrong and then we start talking to each other you know because then it's clear that something is about to break down. And so anyway -yeah? >>: [inaudible]. >> Roger B. Dannenberg: Yes? >>: [inaudible] question but the [inaudible] that is this famous MRI center for imagining of the brain. >> Roger B. Dannenberg: Yeah. >>: Have you done experiments where you have a [inaudible] or playing music and seeing the brain images of the [inaudible]. >> Roger B. Dannenberg: I have not. I know that there's been a lot of research on music performance and MRIs. It's just not an area where I've done any work. >>: [inaudible] because the MRI center in Pittsburgh is [inaudible] yeah. Yeah. And some of -- I know that they do a lot of realtime processing that I don't know if that's been picked up everywhere else yet but I thought about -- you know, it's one of those things where it I think this is incredible. There ought to be some interesting, you know, question I could ask and get a great answer to. It's almost like sometimes you see it as incredible you want to find a problem to solve with it. But I haven't -- you know, I just haven't done that. Yes? >>: [inaudible] if you thought at all about the maybe not quite so formal setting but still a group setting so like say you're jamming but you're almost like performance -- somewhere between like you know the full open jazz thing and where you're still kind of coming up with pieces. The notion is you still have parts that you might want to trigger and maybe lose but you're coming up with them on the fly as you're doing them. So it's like you're random composing, it's kind of [inaudible] often get together kind of like that's a good group, I want to build on it and stuff like that. I mean, have you thought at all about like supporting that scenario? >> Roger B. Dannenberg: I think that's -- so the question is about less structured performance situations and how, you know, could computers support that and I don't really have any good ideas there. Although I've personally composed a number of pieces, not the in this kind of popular music or steady tempo genre but pieces for interactive computer and performer, so I see a tremendous potential there. I'm just not quite sure, you know, how to tap into it. I think that it's going to be really interesting opportunity if we solve some of these more standard problems then when you have information about where is 1, what's the tempo, we have good communication and cuing mechanisms then I think we might be able to drop in some computer music generation modules and ways of communicating with them. Yeah? >>: I wonder what you think about video games like Guitar Hero, Rock Band, if they're good for getting people interested in performance and learning about music? >> Roger B. Dannenberg: Yeah. I think Guitar Hero and rock band are incredible. They're -- you know in 2000 -- I'm not sure how they're doing now, but in 2007, they outsold all music downloads actually they were close. They were about a billion dollars business each. But that's a serious music distribution channel. So one of the most interesting things about that is that there's a whole generation of young people interested in music now that are not only getting some performance experience but they're learning about, you know, let's say the Beatles or in early classic rock that a lot of us probably thought was just dead and now there's -- you know, it's very popular because that's kind of the -- what the genre or the collection of songs that ship with some of these products. So yeah, and I think that the feeling of interacting with other players and with mastering an instrument even though -- I mean I think mastering -- pushing on four buttons and watching that visual score is -- has a lot of differences, but I think there's a lot in common. And I think that's why it's so popular is that -- I think it's natural to want to play music and to be involved in the production of music and that really taps into that. So yeah, I think it's a great thing. I think we have -- at some point in the future, I think people are either A, going to figure out how to take advantage of these now millions of young people that have this bizarre music notation reading ability [laughter] they read, you know, four notes scrolling up the screen but the virtuosity of good players is just -- is amazing. It's -- so it's, you know, clearly they've internalized a way of looking in advance and planning muscle movement and all this on this kind of realtime display. And that alone to me is fascinating. And so I could imagine future concerts where somehow the audience is really involved in producing music and the performers will display scores using the notation of the masses which is like Rock Band or dance -- or Dance Dance Revolution. Yeah? >>: Are you comfortable in improving the performance of amateurs so it's like [inaudible] all those things. Is it possible to [inaudible] amateur sound like a pro? I mean, you have some samples of [inaudible] and then [inaudible]. I mean, just replacing as if [inaudible]. >> Roger B. Dannenberg: Yeah. So how can we make amateurs sound like pros or have that sound? I think that's definitely in the realm of possibility and something to look for. Personally I think that in a lot of cases amateurs really have the ability to make the sounds they want. And they want their sounds to be in their recordings. And what's lacking is the consistency that, you know, I was talking to the conductor of the orchestra at Carnegie Mellon. They do some very adventure some, very difficult pieces and perform regularly in Carnegie Hall in New York. And he told me that the real difference between a college -- a good college orchestra and a good professional orchestra is that a professional orchestra will rehearse an entire concert in a few days and prepare it and perform it and with a college orchestra you would spend maybe a few months to do a really advanced performance. He says so it takes longer, you have to work, and you have to learn the parts and -- but the end result is the same. And sometimes -sometimes the college orchestras play better than professionals, especially on very, very difficult pieces that you can't just really sight read and throw it together in a few hours rehearsal, you have to really break it down and work things out. And so -- and that's just another side of this is that it's the consistency and editing, taking all the good takes and putting them together might be more valuable than replacing with professional recorded samples. But I do think that's possible, and I do think that's a really interesting thing to pursue. Yeah? >>: [inaudible] given only a few days to rehearse a piece? >> Roger B. Dannenberg: Typically, yeah. >>: How often do they change their music? >> Roger B. Dannenberg: Well so in Pittsburgh, the Pittsburgh Orchestra plays every -- I mean during the season they play a new concert every week, and every once in a while they have a week off. But they really -- they come in on Monday, I think they get Monday off, and they come in Tuesday. So they basically have Tuesday, Wednesday, and Thursday to prepare, and then there's concerts Thursday, Friday, Saturday and Sunday matinee. So, yeah, and I've been to rehearsals and seen when the Pittsburgh Symphony plays classical symphony like Mozart, it's common for the conductor to use play spots and say okay, let's play the beginning. There might be some tricky section where you say okay, everybody skip to this and we'll play this. But they won't even play the piece through once before the Thursday night opening. But these musicians are unbelievable musicians. You put anything in front of them and they play what's -- you know, they play what's there. And of course they're practicing and preparing the stuff at home. So when they walk in on the first day, they've had the music for months and it's their -- you know, they get -they get paid big bucks to show up and know their stuff. So if anyone actually missed a note in -- okay. I'll tell you another story just came to mind. Carnegie Mellon student horn player subbed for the symphony and I heard that in the performance one of the other horn players missed a note and the conductor, I'm not sure who it was, conductor stopped the orchestra an looked at the student and said you're not in college and then he went on. And it wasn't even her that missed the note. Nobody said a word. But, you know, that's the level of musicianship that's expected that nobody's going to miss a note, so all they have to really do is work on balance and articulation and expression and the orchestra has to get a sense of how the conductor's going to conduct. So it's a different -- I mean the professional world is a really different world. And it's even in Hollywood with film score reading, it's very much the same thing, only those guys usually don't have the parts in advance. They come in and they sight read, and they're expected to play everything down without missing a note. And I've talked to some of those players, even major motion picture theme from rocky, that thing with all the trumpet players, one of my teachers played fourth trumpet on that, and he said the thing in the movie was the fourth take, and nobody had seen the music before, you know, that day when they recorded it. So -- but they're phenomenal musicians and they make a lot of money because they can come in and just -- and be perfect all the time. Yeah? >>: [inaudible] you're doing like [inaudible] as well as special dimension. So are you [inaudible] that knowledge instrument [inaudible] because I [inaudible] certain sections that's in orchestra music where the violins are accompanying all the wind instruments and they [inaudible]. In your example of your video you had a string [inaudible] almost like one single ->> Roger B. Dannenberg: Yeah. >>: But quite often the strings are broken and the [inaudible]. So are you also kind of like accompanying an instrument in [inaudible] to know where you are or to know this is the right [inaudible]. >> Roger B. Dannenberg: We are not doing that. We're not coupling instrument recognition to anything else. Not because it's not a good idea, it's just that no one knows how to do it. And so -- I mean one of the -- one of the problems -- I gets it's almost an unstated assumption in all of this music understanding work is that we don't really know how to do source separation, and so everything we here is kind of a composite of all the sounds at once and we do what we can with that. Or we go out and we put separate microphones on each instrument and try to source separation by [inaudible] I mean we just try to keep it separate rather than merging it together. But you know, I think that's -- there are some possibilities there, but so far people are getting better results and achieving more practical things by looking at pitch and trying to really strip out all the tambrel differences. So whether it's the violin or a trumpet playing, the important thing is what pitches are they playing? >> Sumit Basu: Time for one more question. >> Roger B. Dannenberg: Yeah? >>: Perhaps a silly question. The strings which play, which are being played by the computer, those are prerecorded sequences, it's not realtime? >> Roger B. Dannenberg: That's right. So there's no improvisation there. All of their parts were notated and written out. And so essentially what we're doing is trying to cue them in and keep them synchronized. >>: Thank you. >> Roger B. Dannenberg: Thank you. >> Sumit Basu: So let's thank Roger again. [applause]