>> Sumit Basu: It's my great pleasure today to... from CMU. Most of you here probably already know...

advertisement
>> Sumit Basu: It's my great pleasure today to introduce Roger Dannenberg
from CMU. Most of you here probably already know that he's a very big figure in
computer music. What you may not know is that he's worked in a huge variety of
aspects of computer music throughout his time there.
Before I even get to that, he has a -- so he's an associate professor at CMU, in
computer science and also works for -- has an appointment in the art
department. But he has worked not only in -- so let's start with the programming
language stuff.
So first he worked on programming languages for music and for sounds, and this
has brought some of the first functional programming language ideas into
sounds, and this is stuff which is still used today. He's worked -- he's worked on
automatic teaching systems for music like the piano tutor system I believe it was
called, which would take a year's worth of music instruction and compress it into
something like 20 minutes of time.
>> Roger B. Dannenberg: 20 hours.
>> Sumit Basu: 20 hours. 20 minutes would be great. But 20 hours is still
pretty good. And which was very revolutionary. And more recently he's done a
lot of work on music analysis, looking at structure analysis as well as trying to
recognize genres and styles of music.
In addition to all of that and a being a professor at CMU, he's also managed to
be a world class performer. He plays the trumpet, and he's played with a lot of
amazing people, has performed at the Apollo Theater in Harlem and all kinds of
other places. Just an amazing guy, and I'm really, really happy to have him here
for the day and give this talk.
>> Roger B. Dannenberg: All right.
[applause].
Well, thanks very much. It's great to be here. So what I'd like to talk about today
is work -- a little bit about work that I've done in music understanding and really
try to focus today on an application area that I'm moving into and so the main
point of this is I think music is for everyone.
MLS of people are practicing musicians, so I'm not talking about just listening to
music but actually performing music. And this is that just a startling number of
people do at some level of proficiency and many, many more people would like
to do that. And so I think that computing is a way to make music performance
more fun, more available, higher quality, and it's a big area to explore, and I
spent a lot of my time doing that and thinking about that.
So what are the problems that we can solve in this area in well, one thing and a
very big thing is just practicing is a lonely thing for most people. You don't really
want to practice with other people. You know, the funny line that when a
musician makes a mistake the other -- the guy next to him if he wants to say a
joke says hey, practice at home. And so -- but also musical partners are not
always available, so if you want to play music in an ensemble and you're not
doing that daily in a professional basis, it can be hard to get people together and
make sure all the parts get covered.
Another issue for amateurs is that while 100 years ago all music was live and the
quality of music was not necessarily all that great by today's standards, now we
have recordings of the best symphonies, the best solos. We almost don't hear
anything but a virtuoso playing. And so if you're an amateur, you've got a much
higher bar setting the standard for music. And so we might think about ways that
we can help amateurs achieve the quality of sound that they -- you know, that
they imagine that they would like to do.
So what if we had computers that could play with us, editors that could fix our
mistakes and new forms of personal expression? And so this is the direction I
want to go.
And I think an important way to get there is this area, this research that I call
music understanding which is the recognition of pattern and structure in music.
So that's a very -- intentionally a very broad kind of definition, all encompassing
almost. And so what do we mean by pattern and structure?
Well, I really mean structure in many different levels. So we have what we might
call surface structure in music which are things like pitch and harmony, loudness,
identifying notes. All right. So these are very, very specific not very abstract
concepts. And then there is much deeper structure in music such as the
relationship between phrases, the association between printed music and music
audio, emotion in music, both understanding what emotion is -- music is trying to
express and also understanding how to express and emotion through a musical
performance.
And that's just part of expressive performance, which includes not only emotion
but other kinds of musical issues. And so trying to get computers to deal with all
of these aspects of music is my biggest interest and really the main goal of music
understanding.
So there are some tasks in music understanding. This is not all of them but just
generally I would say in music understanding we work on problems related to
matching musical sequences and so by music sequence I mean both music
notation or something like a MIDI file, symbolic and performance information and
also audio information. So we have problems of matching symbolic scores to
audio, we have problems of matching audio to audio. For example, find covers
of a -- given a song, find other artists who have recorded the same song but in a
different style. So those are some different kinds of music sequence matching
problems.
And also searching for music. So query by humming systems where you hum a
tune and look for it in a database. That's another music sequence problem,
music recognition problem.
Okay. So another set of problems have to do with parsing music and that
includes classification, understanding either genre or emotion or identifying
instruments. It includes segmentation and structure. So for example find where
the beats are, find where the notes are, find when the -- find the chorus of a pop
song that gets repeated several times of a pop song that gets repeated several
times so find that section of music that tends to repeat.
And so those are all -- I mean, I put the word parsing in quotes but generally
there's a whole set of problems associated with that type of thing. And then
finally there's expressive performance which I mentioned before and expressive
performance really has to do with the gap between taking a piece of music
notation which where having at least in western notation things are quantized to
beats and, you know, absolute rock steady metronome tempo and you feed this
into a synthesizer and you get something that sounds like an early cell phone
performance. And it's not very musical or very interesting.
And so there's this big gap between what -- between that just literal rendering of
note information and what a human musical performer can do to make the music
expressive.
So I'd like to start by showing this -- by showing this video and it probably at
least a couple of you have seen this, and I apologize for showing it over and over
again, but this is work that I -- some of the first work that I did in the field, and I
have I it's still such a good example of what could be done that even though I
could show you more modern version, I think it's more impressive to see here's
stuff that actually ran in 1985 and it was kind of an applied music understanding
system. So let's switch to -- whoops. Okay. Here we go.
This is a demonstration of a computer accompaniment system. I'm going to
begin by loading a file into the program. The file contains a score, and in the
score there's a part that I'm going to play on trumpet that's also displayed on the
screen. There is another part which is accompaniment part in this case
composed quite some time ago. The purpose of the program then is to listen to
my performance, a live performance of the solo part and to synchronize the other
parts and play them along with me.
[music played].
The computer accompaniment system is based on a very robust pattern matcher
that compares the notes of the live performance as I play them to the notes that
are stored in the score.
To illustrate how well the pattern matching works, I'm going to deliberately create
a nightmare for the accompanist by playing lots of wrong notes, changing the
tempo, and missing some notes.
[music played].
Okay. So just a quick word about how that works and what's going on. There's
the trumpet performance is going into a this input processing box which really is
doing pitch recognition, and the computer, because I loaded up this score, has a
sequence of what I'm supposed to be playing. And those get matched together
using a -- kind of a modification of longest common subsequence string matching
dynamic programming algorithm modified to deliver results incrementally in
realtime.
And the -- then the accompaniment part is also stored with the score. And I
think the most interesting thing about this structure is that this matching process
and the accompaniment performance process are loosely coupled processes.
So the information practice matching is constantly going over into the
accompaniment performer, but the performer doesn't play exactly what is
matched because it's trying to do a musical performance. And it knows or the
system is based on the fact that an accompanist is really another musician so
that accompanist has an idea what the tempo is and how to perform things
musically and so these are kind of loosely coupled performances.
And of course the output from that just drives a music synthesizer and so that's
what you're hearing.
Okay. So the -- this work is behaved on or that video and that whole
accompaniment system is based on the idea that music is played with expressive
timing and the way that you should accompany a performer is to listen carefully
to the notes that they play and listen carefully to their timing and adjust to
comply, to synchronize with that. So if you speed up a phrase and slow down a
phrase, then the accompaniment is going to follow that. And that is a bit of a
simplification, but it's more or less the way we make western classical chamber
music. And so this system is good for that. And lots of people -- this has been
commercialized. There's over 100,000 students using practice systems, able to
practice at home doing this kind of music.
And that's all great, but long after doing this, I was playing in a -- playing with a
rock band and just getting a little bit frustrated that I couldn't -- there were some
notes that I was just, you know, at the end of the night I'd be really tired, and they
were hard to hit and the band played so loud it was hard for me to come up to
that volume. And I was just about, you know, every once in a while, it would be
really great if I just had some samples of my playing when I'm feeling really good.
And if I could just cue them and get them into the sound system, the band is so
loud nobody would know that it's not coming acoustically out of my trumpet, we
could just come out of the speakers. And that would be really great.
And then if you could do that, then I'd say why stop there, you know, instead of
two trumpets and a trombone and a sax we could have three trumpets and two
trombones and a couple of saxes.
>>: Nellie Benelli.
>> Roger B. Dannenberg: Yeah. Well, except that we'd be in control, and we'd
be performing. But it's -- yeah, it's close to Nellie Benelli. And so my second
thought was hey, wait a minute, I've already done this. I know how to build a
accompaniment systems, this is just an accompaniment problem. And I thought
yeah, I should do that. And then so the third thing -- thought that I had was, wait
a minute, if I really want to play effectively with a rock band, the really important
thing is timing, and the -- also to do the kind of music that we're playing, the
trumpet -- the accompaniment system can't really follow another trumpet player
because we sit out for 16 measures or more and then suddenly we just come in
on this entrance.
And the entrances are -- well, the whole band doesn't necessarily follow a strict
score from beginning to end, because there might be some improvisation, there
might be mistakes you have to deal with. And also, even following -- if you didn't
want to follow a trumpet, maybe you want to follow a guitar player or something.
But the guitar player is playing off of chords. He's not playing individual notes
that we can actually recognize. He's not doing that consistently from one
performance to the next because he might just say okay, I'll voice this chord
differently or I'll use a different strumming pattern tonight, and who knows what
could happen.
And so the more I thought about it, the more I realized this is a completely
different problem. And but it's a big problem. I mean, it's big in the sense of lots
of subproblems, lots of twists and also lots and lots of applications.
So I call this problem the -- well, performing popular music with computers. And
I don't really like the word popular but I use popular because that seems to be the
best word that characterizes a very wide class of music where tempo is very
steady. So I really could call it performing steady tempo music with computers,
but somehow that seems a little awkward.
But I mean not the pop genre but I mean rock -- could be rock, pop, techno, jazz
-- most jazz is in this category, most folk music is in this category. So it's actually
very, very wide class of music.
So the goal is to create more musical opportunities for people, to create
ultimately new artistic directions for musicians. And when I say performing
popular music, the performing part implies that we're going to do live -- this is for
live performance of humans and computers. And the popular part I explained
means steady beat.
So the big research question is how can we coordinate human and computer
performers of popular music? And I have an example to show you, another
video. So this is work -- this was sort of the first implementation of a system that
we took out for a test drive in April of this year with the Carnegie Mellon jazz
ensemble, and we set this up as a initial problem without even implementing the
system. The first thing I did is I went to a really wonderful arranger, John Wilson,
who is the guy conducting here, and asked him if he would write a big piece for
jazz ensemble and a 20 piece string orchestra. Or the 20 -- he actually gave me
the number 20. I said how many strings do you need to do something that would
just sound great? And he said give me 20. And I said okay, 20. Not really
knowing how I was going to do this, that's what we decided.
So he went off and did arranging and I went off and started writing software.
And we talked a lot about how we were going to recognize and what he could
and could not do not arrangement. But this is what we came up with.
[music played].
Okay. I'm going to pause for just a second to explain what's going on in case it's
not clear. So we do have one string player who's live. That's Dave Pello
[phonetic], who is actually the director of the ensemble which is about to come in.
And all the other strings that you hear are up here. These are eight studio
monitors. They sound incredible. And what they're playing are -- well, I'll tell you
more about how we did this after you listen to a little bit more. But the high,
especially the high string violin section stuff that you hear is all coming from a
computer. And the rest of the band is piano, drums, bass, saxes trumpets,
saxes, trombones and trumpets.
[music played].
Okay so this goes on for a long time. I'm going to jump to the end. So he's
improvising. Yeah. I hate to -- so this is also great tenor improviser in the middle
place for a while. So the strings are in the background.
[music played].
Okay. And I'm going to jump to the end I think if I can find this.
[music played].
Whoa. That was a little too much.
[music played].
[applause].
Okay. So -- yeah?
>>: I'm finding it hard to appreciate how much the system is doing without
knowing kind of the rules of engagement on both sides. Like ->> Roger B. Dannenberg: Yeah.
>>: Whose doing the improvisation, who is sort of [inaudible] music?
>> Roger B. Dannenberg: Yeah. So that's -- I want to tell you about that. And
I'll tell you in details about what the system is doing.
>>: [inaudible] until the end of this, then almost at the end of the solo with the
strings coming in.
>> Roger B. Dannenberg: Yeah. Yeah.
>>: So is that part of the score or is it -- is it waiting for some specific note for
[inaudible].
>> Roger B. Dannenberg: Well, so that's -- so that's part -- I mean the part of
the point of showing you some different parts of this video were to illustrate some
of the problems that even though it's mostly steady tempo, there are things like at
the very beginning the strings were sort of following a bit of a rubato performance
by the bass and then at the very end things kind of came to a hold and then the
conductor cued in the string section which plays this very fast thing.
So yeah, so that's in the score. But the score says that there's going to be a little
cadenza by the bass and he's going to hold and then they come in on tempo. So
that's part of it.
>>: What was one part in the very beginning that looked like and awesome
thing. I don't know if it was releasing happening but it was when the cellist was
silent and the exact moment that the cellist made his first sound the strings came
in, it looked like, you know, within a subsecond of the same time. I didn't know
how that was synchronized.
>> Roger B. Dannenberg: Yeah. Well, so I'll play the beginning again so you
can hear that.
[music played].
Yeah, like right there.
[music played].
Well, so he's actually conducting. So what's happening is the -- so we're using -in this case we're actually tapping in time, and I'll give you some more details, but
there's no -- sensing is not by audio, the sensing is from a much more kind of
discrete foot tapping and manual queuing sort of thing which makes the problem
a lot simpler.
>>: [inaudible].
>> Roger B. Dannenberg: Yeah, yeah. And so -- well, yeah, we're using image
capture but it's through two human eyes that are watching the conductor. And
the bass player is also watching the conductor. And even that -- well, so let's get
back to the -- oh, yeah, good. I have a picture up here.
So this is the implementation for that April concert. There's a foot pedal and
based on foot tapping we're doing tempo estimation and sort of forward
prediction of where the next -- where the next beat is going to be. We also have
a little keyboard which just provides some extra inputs that are used for cues
because what happens -- well, I'll talk about why we do this later. But anyway
the cues and the beat prediction are sort of integrated to find a score position,
and the score position drives variable rate playback of audio. And then what's
happening on the -- for the variable rate playback, we have again 20 -- 20
channels -- 20 strings, each string on an independent audio channel because
that enables you to do very high quality time stretching.
The time stretching works by labeling each period of each instrument. So we
have something like eight megabytes of just pitch labels. And so if you want to
stretch a string out, let's say if you want to stretch by 10 percent and you take
every 10th cycle, every 10th vibration of the string and copy it, copy that 10th
one, so now you have 11 vibrations and if you -- if you do the copying very, you
know, carefully, you can -- you can do that with no glitches and it's essentially
artifact free, works really well.
And then similarly, if you want to play 10 percent if the other direction, I'm not
sure which -- if you're scaling the other direction, you can just leave out every
10th vibration and then you play faster.
>>: [inaudible].
>> Roger B. Dannenberg: No. One of the band members was.
>>: Who is also playing another instrument at the time?
>> Roger B. Dannenberg: That was the original plan. And everything was built
to do that. And it turned out that the arrangement did not call for vibraphones
and the band has a vibe's player who is also an excellent percussionist, and she
-- so she ended up doing the pedal so she would have something to do.
And ->>: Is this kind of like the system watching the conductor? Is that what the foot
pedal is ->> Roger B. Dannenberg: Yeah, yeah, and so she's listening to the band, she's
watching the conductor. She's also -- she insisted that she be able to have visual
contact with the high hat on the drummer. So, the high-hat's the little cymbal that
is called a sock cymbal that goes up and down.
And this drummer for this piece would normally play that on the beat, so it was
helpful to see that.
>>: Just getting semantic interpretation of a solo is about to end, isn't that ->> Roger B. Dannenberg: Yeah, yeah, so some of the cues are coming from the
conductor, some of it's just she can listen and she knows what's happening. But
there were a lot of -- a number of string entrances, may be 10 different entrances
during the piece, and we -- I intentionally made it so that they had to be cued,
they would not happen automatically. And the main idea there was if the band
ever just really messed up and fell apart and got back together again or if she
messed up then I or someone could kill the strings for that entrance, you know,
reset things and then bring it back up and then it would make the next cue.
You know, and in live performance, you know, people miss their cues, so -usually it's not the whole string section at once that drops out [laughter] and that
would be bad, but that's -- you know, the audience would probably never know
the difference. So that's what we always say. When those things happen.
Okay. So another interesting thing about the playback is there are a lot of
systems that have been implemented to do this kind of time stretching. Usually
it's time stretching by a fixed amount over some interval of time. And if you use
pro tools or some other editor, you select something and say stretch is 10
percent approximate.
So what we're doing is actually doing the time stretching with interactive control
in realtime. And one of the things that happens there is that because the
stretching works on periods of vibration, the amount of -- you can only cause time
to jump ahead or lag by these discrete intervals. It's not really a continuous
process.
And so if you have 20 strings and they're all playing different pitches and you're
trying to -- and you're constantly kind of ramping them up and down in tempo to
follow the band, then all these quantization errors are going to start piling up and
the whole orchestra would start -- potentially would diverge. I mean, it's kind of a
statistical process. So what I had to implement was -- so this PSOLA is pitch
synchronous overlap add. That's the process for reassembling this audio.
And so we monitor -- we have a buffer in front of the time stretching that is being
filled from the 20 channels of audio into these 20 buffers. And by monitoring, you
know, whether the buffer is getting full or getting empty, we can modulate the
amount of time stretching that we're telling the pitch synchronous overlap adds to
do. So it's kind of a servomechanism so that they all track each other and we
don't accumulate any grounding error.
So all that stuff's going on in realtime during the performance. Yeah?
>>: [inaudible] an actual 20 string section performing that piece at reasonable ->> Roger B. Dannenberg: Yes.
>>: Not generic [inaudible] but somebody performing that ->> Roger B. Dannenberg: Right, right. It's actually recordings. We did the
recordings by first recording -- it's a little -- we actually did put down a click track,
and then we had a rhythm section play with the click track so that they would be
-- it's actually the same rhythm section so that they would give the kind of
syncopated feel that the piece called for.
And then we multi-tracked -- we had four or five violin, viola and cello players
come in and multi-track the piece. So they're actually -- you know, we had at
most five players, but we ended up with 20 tracks.
>>: [inaudible].
>> Roger B. Dannenberg: Yeah. Each one on a separate -- so we were able
with our studio, we were able to record three at a time in isolation booths. So -but it's so it's all totally isolated.
Okay. So that's an example. What I want to do is now is talk about a -- the way
I'm thinking about this problem. I see, you know, what we did in April is just a
first -- a first, you know, prototype implementation to learn about the problems.
And now I'm really thinking about other ways to solve, there's better ways to
solve this, how to generalize this. And I think the way to think about this is to
kind of analyze this type of performance, this popular music performance and
think about just break it down into what are the elements, and once we have a
more abstract way of looking at performance, we can think about how do we get
the computer to do this.
And so I have this thing I'm calling the performance model which really has two
parts. There's a listening part and there's the performance, the playing part. And
so on the listening part, here's what you have to do to play -- to play music with
other people.
First of all, you have to know generally where you are in the music. So, you
know, are we in -- are we playing, you know, what piece are we playing, with we
playing letter A or letter B, or is this the verse or the chorus, that type of thing?
You also have to know the beat and the tempo. Because again, we don't
synchronize by synchronizing notes, we synchronize by having some abstract
notion of where we are in the piece, and the primary structure there is really the
beat. And the fact that the beat is steady allows musicians to achieve a lot have
synchronization even though they might all be improvising. So beat and tempo
are really critical.
The second thing is, and this wasn't obvious to me at first, but it's really
important to know where 1 is. So what I mean by that is most music is in 4-4.
Other music is in 3, or 5, or 7. But generally if you turn on the radio you'll here
something in 4 which means the structure is 1, 2, 3, 4, 1, 2, 3, 4, and so when
musicians say, you know, where is 1, they're talking about the downbeats of
these measures.
This other piece, da, da, da, da, 3, 4, da, da, those are all on 1, and so if you
know where -- the purpose of finding 1 in music, again, something I never even
thought about before, thinking about this problem, the reason that we have
measures and we think about where one is and that's useful, is that these go by
really fast, you know. If we didn't have measures, it would be 1, 1, 1, 1, 1, 1, 1, 1.
And if you're going at that rate and you try to nod to your musician, okay, let's
come in now, you're going to have 1, 1, 1, which 1 were you talking about? It
goes by way too fast.
I'll tell you another story. I just performed sorcerer's apprentice with a
community orchestra last weekend. And that piece is actually written -- so that's
the bum, bum, ba, ba, ba, bum, ba, bum, ba, bum. This -- it's in 3. It goes 1, 2,
3, 1, 2, 3, 1, 2, 3, 1, 2, 3. So this is 1, 1, 1, 1, 1, 1, 1, 1, 1, 1. And so for the
trumpet player, you look at the score, and you play a little bit and then you're out
for 50, 60 measures. 1, 1, 1. Counting for dear life.
And I finally had to -- I mean, I just after the first rehearsal I said I just can't do
this, I can't count 1 when 1 goes that fast, and so it turns out the piece really is
structured into groups of three measures. I think the notation is just wrong. But
you know, who am I to tell Paul Ducott how to orchestrate?
But I actually had to notate in my music groups of three and that enabled me
cognitively to deal with the stuff going by. So 1 is really crucial. And that's if
you're going to communicate and synchronize with you're players, you do it on
the basis of measures. So that's the point.
And so you have to know where 1 is. And you also have to listen or sense what
other players are telling you. So the other player gives you a nod or the
conductor gives you a downbeat or whatever, you know, somehow those cues
are critical. And the reason cues are critical is that in this kind of music, people
do improvise a lot. They might -- the vocalist might vamp for a couple of extra
bars, somebody might forget to come in. You might have a transition and the
rhythm section forgets how many measures it is and they add some extra
measures. And so people do in live performance kind of look at each other,
gesture to each other, yell at each other. Whatever it takes to stay together.
And so while all this perception stuff is going on, we're also performing music.
So that means you have to tempo regulation, phase regulation, which means
when I know -- just because I know where the beats are, that's not all there is to
playing. If I'm a bass player, I might play a little bit ahead of the beat. If I'm a
drummer, I might feel like I'm playing exactly on the beat. If I'm playing trumpet
with this jazz band that I play with it's really -- it's crucial that the horns actually
play a little behind the beat. This is a big band and it's just the style of the band
is that we play behind the beat. And it makes it sound hip. It sounds cool. It's
jazz. And sometimes guys will come and sit in with the band and they're more
classically trained guys, and they'll play right on the beat. And they play just -they're great musicians and they play every note right and they're on the beat,
and we never invite them back [laughter] because it just -- it just doesn't swing.
So phase is really critical. And not very well understood and not very well
studied, but so there's a whole set of problems there.
And then there's dynamics, how loud you play; and expression, which is kind of
everything else.
So let's talk about some of these problems. The first one is know where you are
in the music. So how do we do that? There's one technique for that that I think
shows a lot of promise is work on audio alignment. So I've done a lot of work in
this area, other people have too. This is an example -- I pulled this from some
classical musical audio alignment. This is Beethoven's Fifth Symphony First
Movement, and the horizontal axis is audio rendered from a MIDI file and the
vertical axis I think is the Philadelphia Orchestra, but it's a human orchestra
playing.
And using kind of a forced alignment dynamic programming and some features
that are -- I can go into details called chromogram, but they're basically extracting
spectral features. We can find the best alignment of these sequences of features
that we extract and actually match up the -- match up two performances of the
piece.
This is screen shot from audacity that I'm doing some work on to integrate these
algorithms so that we can feed in a MIDI file. This happens to be a Haydn
Symphony with a recording and click align and the system will automatically
match up the symbolic representation of the music with the audio.
And so for doing live performance, the idea is knowing where you are, you know,
is this a verse or a chorus? I think we can take existing recordings of the band,
you know, maybe made in a rehearsal and quickly mark them up with, okay, this
is section A, this is section B or this is a verse and this is a chorus and do some
kind of realtime version of this matching.
But there's a whole field of research here for following popular music
performances in realtime and giving feedback to musicians about where they are.
So that's one issue I hope to work on.
Knowing the tempo is the next problem. So the first thought you might have is
well, we'll just use audio analysis and find the beats. And people that have
worked in that area know that we've made a lot of advances but it's still really
hard. And if you want to do this reliably enough that you trust your computer to
get up on stage with you and play and always know where the beat is, I find that
pretty frightening. So that's why I've resorted so far to this foot pedal thing.
I think that after doing this concert in April and after looking at the high hat and
see how consistent it was, I think there's a lot of room for combining multiple
sensors, but even accelerometers on cymbals and drums and use drum sensors
and use beat tracking algorithms and if we put all of that stuff together with a lot
of sensor fusion sorts of technologies maybe we can build something that's
really, really reliable. And so that would -- that's a new direction to pursue.
>>: [inaudible].
>> Roger B. Dannenberg: Conductor. Yeah. There ->>: You can see that there's that movement.
>>: Or even if he has some senses in his hands.
>> Roger B. Dannenberg: Yeah. I've actually captured a lot of conducting
gestures with an accelerometer, wrist worn accelerometers, and other people of
used other kinds of sensors. And so far the conclusion I've -- I'm reaching is that
conductors -- well, that that's -- that might be an interesting source of data, but
it's not enough, and it's not very reliable. In fact, I've also done interviews with
conductors where they conducted and we shot video and then we asked them to
review the video and tell us what were you doing, what were you thinking, and at
least in the classical world conductors will tell us, you know, I'm not trying to
indicate tempo or anything here, and I'm trying to fix this problem with the
dynamics of the piano player. And then the conductor will say, okay, there I got
his attention and we fixed that, and I went back to -- but, yeah, it's very
surprising.
I think the common assumption is the conductor is up there to tell everywhere 1
is and to tell them what the beat is. I think that's not at all what actually happens.
>>: [inaudible] farther back away from the popular music direction you're talking
about ->> Roger B. Dannenberg: Yes, that's right. And so knows popular music has no
conductors.
>>: [inaudible] as well as the not conductor performances. Imagine particularly
in a fusion world watching the movements of the violin bows, all of them, or
watching the moves of the hands of the guitar players, all of them, and being able
to say okay I've got a pretty good idea of the system based on 35 sensors or 350
sensors or 1,000 sensors that I know where we are, and you know, I want to
follow the guitar player now.
>> Roger B. Dannenberg: Yeah. Yeah. And that's exactly the kind of approach
that I think that we need. And so I'm very interested in pursuing that. Maybe not
initially with 300 sensors, but, yeah, I think that's exactly the right idea that the
beat kind of moves around among -- I mean, where is the good place to find the
beat is fluid and so we need to not only have algorithms to look for the beat but to
try to figure out which data is reliable and which data is not and so this is the
whole fusion problem.
>>: [inaudible].
>> Roger B. Dannenberg: Yeah. Yeah. Okay. So foot tapping, sensors, a lot
of room for machine learning in here.
And let's go on to the next topic is where is 1. So in the music information
retrieval world finding downbeats has not been a real problem that people have
identified. But I think there are some techniques, for example finding out where
our chord change is happening, looking again at player gestures and then, you
know, what we did, the reliable, simple but kind of distracting way of giving cues
is to just -- is press the key or some kind of out of band signal, which for a
percussionist is no problem. But you kind of have to have a free hand, and you
have to be thinking about it.
The next thing is communicating cues with other players, and so the problem
here is not how do you know where you are but how do you know you know
where you are? And musicians do this all the time in very subtle ways
sometimes. So for example, when I'm playing with other trumpet players in a
band and we're typically playing different notes but the same rhythms and so just
standing next to someone who's breathing, inhaling, getting ready to play the
note at the same time you are, you have this kind of sixth sense of that you must
be doing the right thing because you're breathing at the same time that they are.
And it's something that you don't really think of consciously except when they
have an entrance that's one beat after you and you breathe and they don't
because they're doing the right thing, it's very easy -- you know, you really have
to either trust yourself and go with it or you can get faked out or you just miss it in
rehearsal and next time you make a note and remember that you have -- this is
your solo and you don't have to synchronize with somebody else.
So those kinds of interactions I think are very hard to reproduce with computers,
and so I'm not sure how to do that. But I think it's going to take some
combination of visual displays. We talked about tactile displays like, you know,
put a -- something that would tap you or shock you or buzz you or something.
And people, there have been some instances of people using I guess little
vibrators attached to performers so performers so could signal each other, which
I think is -- without making sound, which I think is a really interesting idea. But
okay. So there's kind of a whole research area here of this kind of cuing during
musical performances.
And then we get to performing. Now that we've thought about listening and
sensing, what about playing? And so part of this is tempo and one of the
problems is given prior beat times how do you know the current tempo and how
do you estimate time of the next beat? So initially I thought, well, we're dealing
with steady tempos, so this is not really a problem. And it turns out that as I've
been measuring tempo of live performances, I find a lot more variation than I
expected. I mean, even a jazz ensemble playing with a drummer keeping time
will fluctuate tempo up and down 10 percent based on whether the soloist is
getting excited. Or who knows what's going on, you know? I don't really know,
but I see the data and see tempo drifting a lot.
And so this graph is an example of just saying let's take N previous beats and do
a linear regression and call that the tempo and use that to estimate where the
next beat is. And let's figure out for different sizes of N how good the prediction
is.
And so this is what this graph shows. So this says that down -- if we don't use -if we only look at the two previous beats to guess the next one, our prediction
error is kind of high because these are not true beats but foot taps, and there's
jitter and noise in the tapping.
If we smooth over some more beats like we go up to five to ten beats, then we
hit a minimum here where the prediction is pretty good. And then at some point
as N gets large, we're actually smoothing in data from a time when the tempo
was different. And so we're using the wrong tempo to estimate the next beat.
And as this number gets large, the error gets -- well, this is up to 45 millisecond
of typical -- well, actually I think the standard deviation. But anyway, we're
getting large errors if the window gets too large.
And of course, you know, window size depends on the characteristics of the
players and, you know, how steady things really are.
>>: Is this [inaudible] different genre, is it something specific to ->> Roger B. Dannenberg: Well, this general U shape I think is common, but the
-- you know, it would -- the minimum will move to the right as the music becomes
more and more steady. And so it really depends on the players. Probably
depends on the genre, too.
>>: Seems to be two to three down beats, right? I mean, if you get to the two
beat thing you're basically not even capturing you know two measures. So it
seems like that would be outside ->> Roger B. Dannenberg: Yeah, so most -- well, actually, you know, so this
curve looks like the minimum is at six. There's some other data we looked at
where that number got a little bit larger. But, yeah, I think in the five to ten range
is typical. Yes?
>>: Does your synthesis system have a delay in it, you have to sort of play -start to -- initiate the player early?
>> Roger B. Dannenberg: Yeah, you do. And so that's -- that's why we can't
really wait for a foot tap or anything, we want to predict it.
And actually, it also turns out that if you look at past data again with popular
steady tempo music, if you look at past data, you can predict the time of the next
beat more accurately than someone can actually tap on that beat. So when
they're tapping, they're getting typically around 30 millisecond standard deviation
around the true beat, which is unknown but we have some interesting ways I
could tell you about later for why do we know that it's shirt milliseconds when we
don't really know where the beat is.
But we do know that. And so using this data, we can get -- we can shrink that
town to, you know, more like 20 or 25 milliseconds. Yeah?
>>: [inaudible] there were a couple of string entrances from robato sections.
Did your vibraphone/foot pedal player learn to anticipate by audio lag, or did we
just not notice? Do you know what I mean? If the vibraphone player signal -- to
tap the pedal or press the key maybe is the case there right on cue, we'd expect
to see here some audio lag unless that person had kind of gotten good at it and
--
>> Roger B. Dannenberg: Oh, yeah.
>>: And 25 milliseconds or whatever it is that ->> Roger B. Dannenberg: So what we did in this piece was the strings always -the strings never come in immediately when the cue is given. So the cue is
always given ahead of times. Sometimes that had to be done on ->>: [inaudible] tempo beats in between the cue [inaudible] the string entrances?
>> Roger B. Dannenberg: Right.
>>: Okay.
>> Roger B. Dannenberg: Right. And sometimes things were basically hard
wired so that there wasn't actually a separate cue, it was just tapping, which was
the scariest part, but it really wasn't such a problem because the tapping
interface is so reliable.
But, for example, the conductor would go 1, 2, 1, 2, 3, 4, and then -- and so
when conductor did 1, 2, that enabled the tapper to get the tempo in her head.
And then when he goes 1, 2, 3, 4, then she starts tapping. And I think she would
only tap actually I think two beats there and then the strings would come in.
>>: [inaudible] that you placed on the [inaudible] they shouldn't just be wild
entrances that didn't have time for say count before them?
>> Roger B. Dannenberg: Yeah. Hey actually -- you know, we talked about all
this, and he went off and did what he wanted, so [laughter] I ended up -- I mean
the only concession we really had to make is that between the slow section and
the fast section he wanted to just go something like, you know, 2, 3, 4, boom and
have the band come right in and I said no, you can't do that because the strings
have to be there. So you got to give me two measures of count between the
slow section and the fast section. So -- but otherwise all this kind of
pseudo-robato stuff, we agreed that we weren't going to do that.
And so, you know, we ended up -- it actually wasn't that much robato, it was just
the way it was written it kind of sounds robato.
>>: [inaudible].
>> Roger B. Dannenberg: Yeah?
>>: On the previous slide so when you why kind of adjusting, do you have some
mechanism where you were kind of like catching up if you were making a
mistake or -- I mean like calibrating the window size live based on ->> Roger B. Dannenberg: No, and that's something we've tried looking at, so
there's -- for example, there's a formalism called a switching common filter that
tries to look at whether your model is fitting the data, and when it's not you could
switch to another model.
We haven't had luck with the switching models yet, but I think that's possible.
And I think that human players are doing something like that, that when -- when
the assumption is steady beat and somebody needs to change the tempo, it's
very common in a performance with players that somebody does something, they
go like let's speed it up or the -- you know, the band leader will say to the
drummer like, you know, to something. And so that's kind of an out of band
signal that's telling everybody, okay, reset your tempo estimation. And that
seems to be necessary for humans. I'm not quite sure how to do that with
machines, but that's a problem.
Okay. Let me just -- I think I talked about this a little bit. Oh, yeah and I want to
get to -- okay. So the point here is there are lots of interesting problems. And we
could go through lots more of them. I want to -- I want to just mention some
other work with this example. This is -- I talked about this problem of
professional, even virtuoso recordings are things that we have to -- that amateurs
have to contend with. Everyone hears them now.
And so one thing I think we can do for amateurs, especially, but very useful for
professionals is use computers to fix up their recordings according to scores or
other information to make them sound better.
So people do this by hand all the time, but it's extremely tedious. And so I've
started looking at maybe we could automate a lot of this process.
So this is a trumpet trio that I performed and I intentionally played this without a
click track. There were some long gaps where you had to just count and then
come in and a hope you were right. And I knew it would fail. And so this
generated kind of a what in the business we call a train wreck. So I'll play that,
and then I'll play the editor output.
[music played].
Okay. So -- and this is a picture of some of the wave forms. Like in the score,
this note and this note and this note should actually all be aligned with one
another and they weren't. Okay. So I took the machine readable version of the
score, which, you know, could come from just a MIDI file, and the audio all as
separate tracks, and I fed them into this little prototype editor that tries to look for
timing discrepancies and fix them up. It also looks at intonation. And if
something is generally sharp or flat, it fixes that. And it also looks at dynamics
and tries to make everything balanced.
And so here's the output. There's still definitely some problems, but -- and if I
thought of -- if I realized what was going to happen in advance, I could have sort
of made a different mode in this editor to handle this too. But I wanted to be able
to claim that I ran this stuff through the editor, and this is what you get with no
correction and so here we go.
[music played].
So better intonation, better synchronization, better balance. I mean, it's really
remarkable to think you could just take that initial recording in a totally automated
process come up with this. So my vision for amateurs and even professionals in
the future is something like spelling correction where you put all the -- you put the
audio up on the screen and a little window pops up and says you came in early
here, to you want me to fix it? And you say yeah. And you just go through the
score in minutes and make something that sounds like it was recorded by, you
know, top studio players or something.
Okay. So let me just wrap up and say that performers do two things. They
gather information about location and tempo, where is 1, they get cues from
other players. And performers play music. And so there are problems of
intonation and dynamics and synchronization. And each one of these is an
interesting problem from signal processing perspective, from the machine
learning perspective and computer science and from music. And so -- and if we
can do all of these things and put it together, then we'll have a system or kind of
a whole new product category that just millions of people will find interesting uses
for.
And so that's -- I've shown you a little bit of work that we've done to get started,
lots of open questions for the future, and that's what I wanted to share with you.
So thanks for your questions and attention.
[applause].
>> Roger B. Dannenberg: Yes?
>>: When you introduced the bottom of -- moving into popular music domain
you hinted that the problems are score following for pop music where there isn't
really score, so score following -- like following the guitar. The sequence of
guitar, chords that are loosely interpreted. But the work you actually mostly -there was a score. There was a fairly rigid score where there might be
improvisation injected but there was a fairly reduced score. Have you actually
moved down that direction toward following the loosely scored pop structure? I
mean following guitar chord changes and in certain pop settings?
>> Roger B. Dannenberg: Yeah. So one thing I can tell you is this score
alignment stuff that we've done, we've tried this on rock tunes like Beatles tunes,
for example. And what we found is that what we would align MIDI files with
actual you know Beatles recordings, audio. And very often the MIDI files were
not really transcriptions of what they actually did the. They would sort of be -- be
almost like a cover tune but rendered in MIDI. And so in spite of that, we found
that we could do alignment. So that would be cases where the vocal in the tune
would be one melodic line and somebody sort of did an arrangement for MIDI
and they harmonized it. And so they were actually injecting major melody notes
on non-vocal instruments. There would be more phrasing and improvisation in
the audio vocal line than there would be in the notated score. And in spite of all
that, we can actually do alignment.
Although when we do that, we're doing forced alignment, not sort of incremental
score following. And so it still remains a question of, you know, what's the
latency going to be when you do the alignment? Yeah?
>>: Yeah. Have you thought about ways you can have the computer give
feedback back to the musicians? Because it seems like you're just looking one
way with what you're doing. And I know there's some bands that will have like a
drummer play with a click track live to get that end figuration and it seems like if
you could combine both directions you might get some interesting results.
>> Roger B. Dannenberg: Yeah. Yeah. Well, that's what -- this was sort of a
little prototype just imagining what an interface might look like. And so part of
that is sort of the 1, 2, 3, 4, would blink on a display so the computer could feed
back where it thinks it is, and the computer could be updating measures.
The idea is that as a musician -- I mean, we know from some experiments that
we've done and some evaluation of some interfaces that if you ask a musician to
read a score and you put a display right next to it or something updating a lot of
stuff that pretty much the musicians can't think about both at the same time. And
even if they know the music and they're just looking at the display, they'll look at
maybe the 1, 2, 3, 4 and they'll never, ever see the measure number. And so it's
just very hard in realtime to think about music and keep track of multiple things.
And so that's why I think it's such an interesting research problem. You do the
obvious stuff and ->>: [inaudible] those random cues that happen two or three times on a song.
>> Roger B. Dannenberg: Yeah. But I think what's actually going to happen is
that you need some kind of display that may be that doesn't work the first time
but which is something that musicians can learn to use, and if you present the
right information and they know the information's there and they know how to
access it, then when they're playing and they want that information, that I will be
able to just glance at it, the same way that if I'm not orchestra and I've counted
30 measures but I'm not quite sure if I'm right or not I can turn to the guy next to
me and go like this, and that means are we at measure 32? And he'll either say
yes or -- I mean, he'll either do this back to me in some subtle way or he'll say -or he'll realize that I'm wrong and then we start talking to each other you know
because then it's clear that something is about to break down. And so anyway -yeah?
>>: [inaudible].
>> Roger B. Dannenberg: Yes?
>>: [inaudible] question but the [inaudible] that is this famous MRI center for
imagining of the brain.
>> Roger B. Dannenberg: Yeah.
>>: Have you done experiments where you have a [inaudible] or playing music
and seeing the brain images of the [inaudible].
>> Roger B. Dannenberg: I have not. I know that there's been a lot of research
on music performance and MRIs. It's just not an area where I've done any work.
>>: [inaudible] because the MRI center in Pittsburgh is [inaudible] yeah. Yeah.
And some of -- I know that they do a lot of realtime processing that I don't know if
that's been picked up everywhere else yet but I thought about -- you know, it's
one of those things where it I think this is incredible. There ought to be some
interesting, you know, question I could ask and get a great answer to. It's almost
like sometimes you see it as incredible you want to find a problem to solve with it.
But I haven't -- you know, I just haven't done that. Yes?
>>: [inaudible] if you thought at all about the maybe not quite so formal setting
but still a group setting so like say you're jamming but you're almost like
performance -- somewhere between like you know the full open jazz thing and
where you're still kind of coming up with pieces. The notion is you still have parts
that you might want to trigger and maybe lose but you're coming up with them on
the fly as you're doing them. So it's like you're random composing, it's kind of
[inaudible] often get together kind of like that's a good group, I want to build on it
and stuff like that. I mean, have you thought at all about like supporting that
scenario?
>> Roger B. Dannenberg: I think that's -- so the question is about less
structured performance situations and how, you know, could computers support
that and I don't really have any good ideas there. Although I've personally
composed a number of pieces, not the in this kind of popular music or steady
tempo genre but pieces for interactive computer and performer, so I see a
tremendous potential there. I'm just not quite sure, you know, how to tap into it.
I think that it's going to be really interesting opportunity if we solve some of these
more standard problems then when you have information about where is 1,
what's the tempo, we have good communication and cuing mechanisms then I
think we might be able to drop in some computer music generation modules and
ways of communicating with them. Yeah?
>>: I wonder what you think about video games like Guitar Hero, Rock Band, if
they're good for getting people interested in performance and learning about
music?
>> Roger B. Dannenberg: Yeah. I think Guitar Hero and rock band are
incredible. They're -- you know in 2000 -- I'm not sure how they're doing now, but
in 2007, they outsold all music downloads actually they were close. They were
about a billion dollars business each. But that's a serious music distribution
channel. So one of the most interesting things about that is that there's a whole
generation of young people interested in music now that are not only getting
some performance experience but they're learning about, you know, let's say the
Beatles or in early classic rock that a lot of us probably thought was just dead
and now there's -- you know, it's very popular because that's kind of the -- what
the genre or the collection of songs that ship with some of these products.
So yeah, and I think that the feeling of interacting with other players and with
mastering an instrument even though -- I mean I think mastering -- pushing on
four buttons and watching that visual score is -- has a lot of differences, but I
think there's a lot in common. And I think that's why it's so popular is that -- I
think it's natural to want to play music and to be involved in the production of
music and that really taps into that. So yeah, I think it's a great thing. I think we
have -- at some point in the future, I think people are either A, going to figure out
how to take advantage of these now millions of young people that have this
bizarre music notation reading ability [laughter] they read, you know, four notes
scrolling up the screen but the virtuosity of good players is just -- is amazing. It's
-- so it's, you know, clearly they've internalized a way of looking in advance and
planning muscle movement and all this on this kind of realtime display. And that
alone to me is fascinating.
And so I could imagine future concerts where somehow the audience is really
involved in producing music and the performers will display scores using the
notation of the masses which is like Rock Band or dance -- or Dance Dance
Revolution. Yeah?
>>: Are you comfortable in improving the performance of amateurs so it's like
[inaudible] all those things. Is it possible to [inaudible] amateur sound like a pro?
I mean, you have some samples of [inaudible] and then [inaudible]. I mean, just
replacing as if [inaudible].
>> Roger B. Dannenberg: Yeah. So how can we make amateurs sound like
pros or have that sound? I think that's definitely in the realm of possibility and
something to look for. Personally I think that in a lot of cases amateurs really
have the ability to make the sounds they want. And they want their sounds to be
in their recordings. And what's lacking is the consistency that, you know, I was
talking to the conductor of the orchestra at Carnegie Mellon. They do some very
adventure some, very difficult pieces and perform regularly in Carnegie Hall in
New York.
And he told me that the real difference between a college -- a good college
orchestra and a good professional orchestra is that a professional orchestra will
rehearse an entire concert in a few days and prepare it and perform it and with a
college orchestra you would spend maybe a few months to do a really advanced
performance. He says so it takes longer, you have to work, and you have to
learn the parts and -- but the end result is the same. And sometimes -sometimes the college orchestras play better than professionals, especially on
very, very difficult pieces that you can't just really sight read and throw it together
in a few hours rehearsal, you have to really break it down and work things out.
And so -- and that's just another side of this is that it's the consistency and
editing, taking all the good takes and putting them together might be more
valuable than replacing with professional recorded samples.
But I do think that's possible, and I do think that's a really interesting thing to
pursue. Yeah?
>>: [inaudible] given only a few days to rehearse a piece?
>> Roger B. Dannenberg: Typically, yeah.
>>: How often do they change their music?
>> Roger B. Dannenberg: Well so in Pittsburgh, the Pittsburgh Orchestra plays
every -- I mean during the season they play a new concert every week, and
every once in a while they have a week off. But they really -- they come in on
Monday, I think they get Monday off, and they come in Tuesday. So they
basically have Tuesday, Wednesday, and Thursday to prepare, and then there's
concerts Thursday, Friday, Saturday and Sunday matinee.
So, yeah, and I've been to rehearsals and seen when the Pittsburgh Symphony
plays classical symphony like Mozart, it's common for the conductor to use play
spots and say okay, let's play the beginning. There might be some tricky section
where you say okay, everybody skip to this and we'll play this. But they won't
even play the piece through once before the Thursday night opening.
But these musicians are unbelievable musicians. You put anything in front of
them and they play what's -- you know, they play what's there. And of course
they're practicing and preparing the stuff at home. So when they walk in on the
first day, they've had the music for months and it's their -- you know, they get -they get paid big bucks to show up and know their stuff.
So if anyone actually missed a note in -- okay. I'll tell you another story just
came to mind. Carnegie Mellon student horn player subbed for the symphony
and I heard that in the performance one of the other horn players missed a note
and the conductor, I'm not sure who it was, conductor stopped the orchestra an
looked at the student and said you're not in college and then he went on. And it
wasn't even her that missed the note. Nobody said a word. But, you know, that's
the level of musicianship that's expected that nobody's going to miss a note, so
all they have to really do is work on balance and articulation and expression and
the orchestra has to get a sense of how the conductor's going to conduct.
So it's a different -- I mean the professional world is a really different world. And
it's even in Hollywood with film score reading, it's very much the same thing, only
those guys usually don't have the parts in advance. They come in and they sight
read, and they're expected to play everything down without missing a note. And
I've talked to some of those players, even major motion picture theme from rocky,
that thing with all the trumpet players, one of my teachers played fourth trumpet
on that, and he said the thing in the movie was the fourth take, and nobody had
seen the music before, you know, that day when they recorded it. So -- but
they're phenomenal musicians and they make a lot of money because they can
come in and just -- and be perfect all the time. Yeah?
>>: [inaudible] you're doing like [inaudible] as well as special dimension. So are
you [inaudible] that knowledge instrument [inaudible] because I [inaudible] certain
sections that's in orchestra music where the violins are accompanying all the
wind instruments and they [inaudible]. In your example of your video you had a
string [inaudible] almost like one single ->> Roger B. Dannenberg: Yeah.
>>: But quite often the strings are broken and the [inaudible]. So are you also
kind of like accompanying an instrument in [inaudible] to know where you are or
to know this is the right [inaudible].
>> Roger B. Dannenberg: We are not doing that. We're not coupling instrument
recognition to anything else. Not because it's not a good idea, it's just that no
one knows how to do it. And so -- I mean one of the -- one of the problems -- I
gets it's almost an unstated assumption in all of this music understanding work is
that we don't really know how to do source separation, and so everything we
here is kind of a composite of all the sounds at once and we do what we can with
that.
Or we go out and we put separate microphones on each instrument and try to
source separation by [inaudible] I mean we just try to keep it separate rather than
merging it together. But you know, I think that's -- there are some possibilities
there, but so far people are getting better results and achieving more practical
things by looking at pitch and trying to really strip out all the tambrel differences.
So whether it's the violin or a trumpet playing, the important thing is what pitches
are they playing?
>> Sumit Basu: Time for one more question.
>> Roger B. Dannenberg: Yeah?
>>: Perhaps a silly question. The strings which play, which are being played by
the computer, those are prerecorded sequences, it's not realtime?
>> Roger B. Dannenberg: That's right. So there's no improvisation there. All of
their parts were notated and written out. And so essentially what we're doing is
trying to cue them in and keep them synchronized.
>>: Thank you.
>> Roger B. Dannenberg: Thank you.
>> Sumit Basu: So let's thank Roger again.
[applause]
Download