Modeling, analyzing, and synthesizing expressive piano performance with graphical models Graham Grindlay ·

advertisement
Mach Learn
DOI 10.1007/s10994-006-8751-3
Modeling, analyzing, and synthesizing expressive piano
performance with graphical models
Graham Grindlay · David Helmbold
Received: 23 September 2005 / Revised: 17 February 2006 / Accepted: 27 March 2006 / Published online:
21 June 2006
Springer Science + Business Media, LLC 2006
Abstract Trained musicians intuitively produce expressive variations that add to their audience’s enjoyment. However, there is little quantitative information about the kinds of strategies used in different musical contexts. Since the literal synthesis of notes from a score is bland
and unappealing, there is an opportunity for learning systems that can automatically produce
compelling expressive variations. The ESP (Expressive Synthetic Performance) system generates expressive renditions using hierarchical hidden Markov models trained on the stylistic
variations employed by human performers. Furthermore, the generative models learned by
the ESP system provide insight into a number of musicological issues related to expressive
performance.
Keywords Graphical models . Hierarchical hidden Markov models . Music performance .
Musical information retrieval
1. Introduction
Skilled human performers interpret the pieces they play by introducing expressive variations
to the performed notes, and these variations add to the enjoyment of most1 human listeners.
Although there are many ways in which a performer can introduce variation, perhaps two
of the most important are changes in tempo and dynamics. Tempo change or rubato is a
non-linear warping of the regular grid of beats that defines time in a score. Variations in
Editor : Gerhard Widmer
G. Grindlay ()
Media Laboratory, Massachusetts Institute of Technology
e-mail: grindlay@media.mit.edu
D. Helmbold
Computer Science Deptartment, University of California, Santa Cruz
1
Of course each individual will have their own preferences, but it is reasonable to assume that the top
concert pianists are popular because their performances include something more than a precise rendition of
the composer’s score.
Springer
Mach Learn
the loudness or dynamics (referred to as velocity in the case of piano) of individual notes is
another commonly employed performance strategy. Other ways to add expression to a piano
piece include articulation (the gap or overlap between adjacent notes) and the use of the
pedals. Our work focuses on tempo variations and considers dynamics to a lesser extent.
We present a general framework for capturing the statistical relationship between compositional score structure and performance variation. Our system uses a dynamic Bayesian
network (DBN) representation of a hierarchical hidden Markov model (HHMM) to learn
these relationships automatically from a corpus of training data. Once trained, the model can
be used to generate expressive performances (tempo curves) of new pieces not in the training
set. Furthermore, it provides a statistical generative model of how performers deviate from
the written score that can be used for a variety of purposes.
Although generative models like HHMMs are standard machine learning techniques, they
have not, to our knowledge, been used to model expressive performances. Our goal is to use
as little musicological insight as possible, allowing the latent states in the model to learn the
important musical contexts and their appropriate performance variations. One of the main
contributions of this work is to illustrate the flexibility and power of this approach when
applied to a variety of musicological questions. In particular, we use the learned models to
help answer the following musicological questions.
– How consistent are the performances of human individuals?
– How do the performances of professional and trained non-professional pianists compare?
– Can an automated system recognize an individual performer’s stylistic variations and/or
can the system quantitatively identify those performers with qualitatively distinctive styles
(to human listeners)?
– Is there some unifying structure to the variations in performances by top professionals?
What about the performances of amateurs?
These questions are important as their answers will increase our understanding of the processes and strategies used by talented musicians to produce memorable performances. Additionally, answers to these questions may enable a number of music information retrieval
applications, such as performer search and classification.
Most people would judge the literal rendition of a musical score to be significantly less
interesting than a performance of that piece by even a moderately skilled musician. This
makes methods for “bringing life” to musical scores useful for several reasons. Such a
system could be a boon to composers, who may find the ability to hear their music as it is
likely to be performed extremely helpful during the composition process. One might also
imagine a system capable of providing tireless accompaniment to a soloist or even a teaching
system that could evaluate and critique student performances. Additionally, we believe that
by examining the structure of a trained model, it may be possible to gain musicological insight
into the complex relationships between compositional structure and expressive performance.
Our current work focuses on modeling tempo variation in piano melodies. In addition to
being a very popular instrument, the piano is attractive because of the inherent constraints
on note variation. A piano performer can only vary the timing and loudness of notes. In
contrast, most other popular instruments have additional hard-to-measure ways of varying
the performance of each note (such as vibrato, glissando, and the ability to vary the dynamics
throughout the note). Our techniques can be easily generalized to other instruments given a
suitable encoding of the performance variations. The present emphasis on tempo variation is
entirely due to limitations imposed by the majority of our data. Results by other researchers
indicate that tempo is harder to learn than dynamics (Tobudic & Widmer, 2003). We also
Springer
Mach Learn
present positive results for combined tempo and dynamics synthesis using a smaller (but
richer) performance data set (see Section 8).
Statistical musicology is still in its infancy and it is extremely difficult to collect large
amounts of quantitative data. We are deeply indebted to Bruno Repp for making two data
sets available to us. One data set contains tempo information (only) from 28 performances
of Robert Schumann’s romantic piano piece Träumerei, Op. 15 No. 7 by 24 world-class
professional pianists (there are three performances each by Cortot and Horowitz). This
information was extracted by hand from commercially distributed CDs. The other data
set consists of performances of Träumerei, Chopin’s Prelude No. 15 in D, and Chopin’s
Etude Op. 10 No. 3. by 10 graduate students in the Yale School of Music recorded in MIDI
format.
We employ similar learning strategies for the two data sets using features from both the
score and the performances to build the models. Each model state represents a “musical
context” and the transitions between states represent the changing context throughout a
piece. In our two-level hierarchical models the top-level abstract states represent context on
the level of a musical phrase.2 The production states are children of a top-level abstract state
and represent the various note-level contexts found within their parent’s phrasal context. The
training data contains a sequence of notes from the score and how they are performed. Each
note i in the score has its own vector of score features denoted f is , and an associated vector
p
of performance features, f i , describing how the note was actually played. Each production
state contains a joint distribution P( f s , f p ). We represent performances as the sequence of
performance feature vectors, f p , for the played notes. Similarly, a score can be represented as
a sequence of score features, f s . Given a set of performances and their corresponding scores,
the training process attempts to find the (hierarchical) HMM parameters, θ, that maximize
the likelihood of the given feature sequences. Once the model is trained, a performance
of a new score, f s , can be synthesized by finding the sequence of performance features
(approximately) maximizing P( f p | f s , θ). Given an additional performance, f p , of a
(possibly new) score, f s , P( f p | f s , θ) gives a relative indication of how closely the new
performance fits the trained model.
Because our goals include understanding how skilled performers deviate from the written
score, we are interested in models that are interpretable. To this end, we train using an
entropic bias along with pruning techniques (Brand, 1999b) to encourage sparseness in the
learned models (i.e. relatively low state interconnectivity). This results in models that are
often superior in terms of data fit and generalization than what classical training methods
(EM, gradient descent, etc.) might produce. Additionally, because of their sparsity, these
models are likely to be more interpretable than their classical counterparts.
We conducted a variety of experiments to test the usefulness of the ESP system as a
general modeling framework for the synthesis and analysis of expressive performance (see
Section 7). Despite a difficult learning task, our results were generally quite good. The results
of an experiment in which we synthesized tempo curves using trained models show that ESP
is capable of generating performances of previously unseen scores that follow many of the
general trends of human performance. We also conducted several experiments designed to test
the system’s use in performer comparison tasks. A comparison of the performance models
of 24 famous performers and 10 student performers seem to confirm earlier observations
(Repp, 1997) about the differences in stylistic variation between professional and student
2
A musical phrase is a relatively self-contained section of music somewhat analogous to a sentence in language.
We consider the low-level phrasing information often available in commercial scores. For the pieces in our
dataset, each phrase typically contains between 3 and 15 notes.
Springer
Mach Learn
performers. Additionally, we investigated the ability of the ESP system to recognize individual
performers. We conducted a series of three-fold cross validated experiments in which a model
was trained on two performances by a specific performer and then used to recognize a third
performance by that performer. When trained on two performances by Alfred Cortot, the
system was successfully able to pick out a third Cortot performance from amongst a set of
36 performances by other performers. A similar experiment conducted using performances
by Vladimir Horowitz was somewhat less successful as the test performance was recognized
in one of three trials. This experiment was also carried out for two student performers with
good results.
The remainder of the paper is organized as follows. We begin in Section 2 by reviewing
some of the relevant machine learning and computational musicology work. Then in Section 3
we discuss the data sets and features used in our experiments. In Section 4 we review the
hierarchical hidden Markov model and present the model used in the ESP system. Section 5
describes the training algorithms in the ESP system and Section 6 describes how ESP synthesizes performances from novel scores. Section 7 discusses tempo-based experiments using
the ESP system. In Section 8 we describe an earlier version of the ESP system that uses a
different modeling scheme handling both tempo and dynamics. Finally, we summarize and
discuss our findings in Section 9.
2. Related work
Artificial intelligence and machine learning techniques have found a variety of applications
to musical problems in recent years. Much of this work revolves around the so-called “transcription problem” (trying to extract the symbolic representation of a piece of music from
raw audio) (Raphael, 2002; Cemgil, Kappen and Barber, 2003; Scheirer, 1995; Dixon, 2001)
and the related problems of tempo tracking and rhythm quantization (Lang & de Freitas,
2005; Cemgil and Kappen, 2003). The problems of modeling aspects of composition and
performance style have received somewhat less attention. Since two recent overviews (Widmer & Goebl, 2004; de Mántaras & Arcos, 2002) contain extensive references, we mention
only the most recent and closely related work here.
Wang et al. use HMMs to learn and synthesize a conductor’s movements (Wang et al.,
2003). Their work uses a single HMM where each state models a joint distribution over the
music (energy, beat, and pitch features) and the positions and velocities of markers on the
joints of a conductor. Wang et al. build on earlier work by Brand using HMMs to synthesize
lip movements matching speech (Brand, 1999a) and by Brand and Hertzmann to model
different styles of bipedal movement (Brand & Hertzmann, 2000). We adapt the ideas in the
above papers to the music performance domain and extend the models to hierarchical HMMs
using Murphy and Paskin’s dynamic Bayesian network based algorithms (Murphy & Paskin,
2002).
Whereas the above approaches use one HMM with dual outputs, Casey creates a fusion of
two different drum styles using a pair of HMMs (Casey, 2003). Each HMM’s topology captures the high-level structure of a particular drumming style while the emission distributions
capture the low-level audio produced by the style. Casey finds a correspondence between the
states in the two HMMs and uses this correspondence to combine the low level audio of one
style with the high level structure of the other.
Raphael uses Bayesian networks in his Music Plus One system for providing performerspecific real time accompaniment (Raphael, 2002b). The structure of the Bayesian network
is constructed from the score and then trained with a series of “rehearsals”. After training,
Springer
Mach Learn
the system provides expressive real-time accompaniment specialized for the piece on which
it was trained.
Work very close in spirit to our own, but using different techniques, is that of Widmer and Tobudic. They model and synthesize expressive performances of piano melodies
using a system that automatically learns rules from musical data (Widmer, 2003). This
system extracts sets of simple rules, each of which partially explains an aspect of the
relationship between score and performance. These simple rules are then combined into
larger rules via clustering, generalization, and heuristic techniques. Widmer and Tobudic
describe a multi-scale representation of piano tempo and dynamics curves which they use
to learn expressive performance strategies (Widmer & Tobudic, 2003). This approach abstracts score deviation temporally and learns how these deviations are applied at both the
note and phrase levels via rule-induction algorithms. In more recent work they use nearest
neighbor techniques to predict how phrases are played (Tobudic & Widmer, 2003; Tobudic
& Widmer, 2005). Our work differs from that of Tobudic and Widmer in that we employ
generative statistical models (hierarchical HMMs) to learn the performer’s deviations from
the score rather than rule-based or instance-based techniques. The hierarchical models capture the phrasal context within a piece as well as the individual note contexts within a
phrase.
Archos and López de Mántaras use non-parametric case-based reasoning in their SaxEx
system (Arcos & de Mántaras, 2001) which renders expressive saxophone melodies from
examples. Their system utilizes a database of transformations from scores to expressive
performances with features such as dynamics, rubato, vibrato, and articulation. SaxEx synthesizes new performances by looking in its database for similar score features and then
applying the corresponding expressive transformations. Conflicts that arise when matching
score features to entries in the database are resolved with heuristic techniques.
Bresin, Friberg, and Sundberg use a rule-based approach in Director Musices, a system
for creating expressive performances from raw scores (Bresin, Friberg and Sundberg, 2002).
This system is based on a set of hand-crafted rules for synthesizing piano performances.
These rules are parametric, providing a means for tuning the contribution of each rule in the
synthesis of an expressive performance.
Although much of the research that applies machine learning to expressive performance
has focused almost exclusively on synthesis, Stamatatos and Widmer describe a system for
classifying performances (Stamatatos & Widmer, 2005). This system maintains a database
of professional performers and associates each with certain types of performance features.
Then, given a performance, the system uses an ensemble of simple classification algorithms
to identify the most likely performer.
Saunders et al. (2004) also describe a system for performer recognition. They use string
kernels to represent performances as sequences of prototypical tempo-loudness “shapes”.
These shapes represent changes in tempo and loudness over a short time period (roughly
two beats). Once encoded, they use machine learning techniques such as kernel partial least squares and support vector machines, to recognize the performances of famous
pianists.
3. Data
Statistical musicology is in its infancy, and it is still extremely difficult to collect large
amounts of quantitative data. We are deeply indebted to Bruno Repp for making two data
sets available to us. The first set consists of performances of Schumann’s Träumerei, Chopin’s
Springer
Mach Learn
Table 1 Data set summary. This table shows the number of performers, the number of usable
performances, the length of the performances in notes, and whether or not dynamics (volume)
information is available for each piece used in our experiments
Professional
Piece
Performers
Performances
Dynamics
Length (notes)
Träumerei
24
28
no
117
Student
Träumerei
10
10
yes
117
Prelude
7
16
yes
131
Etude
7
7
yes
27
Prelude No. 15 in D, and the first six bars of Chopin’s Etude Op. 10 No. 3. by 10 graduate
students in the Yale School of Music. The performances are in MIDI format (containing
both tempo and velocity information) and were recorded on a Yamaha Disklavier3 piano
(see (Repp, 1992) for more details). For Träumerei and Etude Op. 10 No. 3. there is one
performance per performer. For Chopin’s Prelude No. 15, each performer recorded three takes
of each piece with the exception of performer EK who only recorded two takes per song. The
performances of all three pieces were matched to their corresponding scores using a simple
dynamic programming-based string matching algorithm. We discarded several performances
because of their poor match to the scores, and call the remaining performances the student
data set.
Our second data set, which we refer to as the professional data set, contains timing information (only) for 28 performances of Träumerei by 24 famous pianists (there are three
performances by both Cortot and Horowitz). This information was extracted by hand from
commercially distributed CDs, and contains the duration of inter-onset intervals in milliseconds. The inter-onset interval for a note is the time between the initiation of the note and the
initiation of the subsequent note. Since notes often overlap, inter onset intervals are not the
same as note duration.
The inter-onset intervals are defined relative to a half-beat grid (eighth notes), and longer
notes may have their inter-onset intervals evenly split into several half-beat segments. Since
each inter-onset value corresponds to the same scored duration, the minutes per beat for a
note is a simple re-scaling of the note’s first inter-onset interval in the data. Inverting this
gives us the note’s tempo in terms of beats per minute.
There is a problem with the last note as it has no following note and thus no defined
inter-onset-interval. Perhaps the best thing would be to use the last note’s duration, but that
value is not available in our professional data set. Therefore we opted to simply duplicate the
preceeding note’s tempo value. Furthermore, we focus on the melody line, dropping other
notes (including grace notes) from consideration.
For each remaining note in a performance, we normalize its tempo (beats-per-minute) by
dividing by the beats-per-minute for the entire performance. This gives the relative speed of
performance at each note. The tempo factor feature is the logarithm of the relative speeds. We
take the logarithm so that slower-than-average and faster-than-average values are represented
on the same numeric scale.4 We also include the first-order differences of the tempo factors
(i.e. how much the tempo changes from one note to the next note). This is not only useful for
3 The Disklavier is a real piano outfitted with digital sensors that enable it to record performances directly to
MIDI.
4 After taking the logarithm, a note played at the average tempo has a tempo factor of 0, one played twice as
fast has a tempo factor of 1, and one played half as fast has a tempo factor of −1.
Springer
Mach Learn
training, but is important during synthesis to produce smoother transitions between musical
contexts (see Section 6 for details). Therefore, we have the following two performance
features for each note in the melody:
1. TempoFactor: The tempo relative to the performance’s average at time, t, and transformed
into log2 space: log2 (Tempot /Tempoavg )
2. TempoFactor: The difference between this note’s TempoFactor and the preceding note’s
TempoFactor.
From the score we extract information on both the note and phrase levels. We currently
use four note-level features and two phrase-level features. All of the phrase data was obtained
from score analyses by two professional musicians. Although we experimented with models
representing multiple levels of phrasing, we achieved the best results using only low-level
phrase information. More elaborate hierarchical models that can take advantage of additional
phrasing levels may be preferable when more data is available. For each note in the score,
we include the following features:
1. RelativeDuration: The logarithm of the ratio of this note’s duration to that of its successor.
2. Pitch: The difference in semitones between this note and its successor.
3. MetricStrength: An integer describing the inherent metric strength of a note based on its
measure position and the current time signature. Notes starting a measure have a metric
strength of 1 (strongest) and other notes in the measure have larger values based on a
metric hierarchy (Lerdahl & Jackendoff, 1983). In practice we capped the metric strength
value so that the few notes with a metric strength weaker than 4 were treated as if they
had metric strength 4.
4. LastNote: A binary feature that is true only for the last note in the piece. In our earlier
work we found that the final notes of performances were often sustained far longer than
their scored duration, and the LastNote feature helped the model learn this pattern.
5. PhraseOffset: A real-valued number between 0 and 1 which describes the relative position
of this note with respect to the current phrase (thus the first note in the phrase has a
PhraseOffset of 0 and the last note’s PhraseOffset is 1).
6. PhraseBoundary: A discrete feature indicating if the note is: the first note of, last note of,
or internal to its phrase.
After distilling these features, each note in a score-performance pair is represented by an
8-tuple having five real valued components and 3 nominal ones.
Figure 1 contains the first two bars of the melody line from Chopin’s Etude Op. 10 No. 3.
Table 2 shows the score feature values assigned to the second note (recall that the first two
features are relative to the following note).
Fig. 1 The first two bars of the melody line from Chopin’s Etude Op. 10 No. 3. The brackets above the score
give the phrase boundaries used in our experiments. In Table 2 we give the score features for the second note
of the melody (contained in the box)
Springer
Mach Learn
Table 2 Score features computed for the second note of Etude Op. 10 No. 3
RelativeDuration
Pitch
MetricStrength
LastNote
PhraseOffset
PhraseBoundary
.5
log2 ( .25
)=1
−1
1
0
.25
0
4. Hierarchical HMMs for expressive tempo
In this section we briefly review the standard hidden Markov model and then describe the
hierarchical Hidden Markov Model learned by the ESP system. A (non-hierarchical) hidden
Markov model (HMM) consists of a set of states Q = {s1 , . . . , sn } that includes a special end
state se , a start distribution over Q, and an output space O. Each non-end state in Q − {se }
contains an output distribution over the space of outputs O and a transition distribution over Q.
The HMM probabilistically generates sequences of outputs as follows. The start distribution
is sampled to pick the state for time-step one, call it q1 . State q1 ’s output distribution is
sampled to select the first output, o1 . We say that q1 emits output o1 . The state for time-step
two, q2 , is picked by sampling q1 ’s transition distribution. State q2 then emits the second
output o2 and picks the state for the third time-step by sampling its transition distribution.
This process continues until some qt makes a transition to the special end state, se . At that
point the process terminates. HMM’s are Markovian because qt , the state the model is in at
time t, depends on the past only through the immediately previous state qt−1 . HMM’s are
“hidden” because all that is observed is the sequence of outputs emitted, and not which state
emitted each output.
4.1. Hierarchical hidden Markov models
The hierarchical hidden Markov model (HHMM) was first introduced in 1998 (Fine, Singer
& Tishby, 1998). This generalization of the hidden Markov model provides a means of
explicitly modeling multi-scale structure in a data set.
Like HMMs, hierarchical hidden Markov models contain a set of states Q and an output
space O. Our HHMMs have three kinds of states: production states, abstract states, and end
states. These states are organized in a tree structure with the production states at the leaves
and the abstract states at the internal nodes. Each group of siblings also contains one special
end state and the non-end states have a transition distribution over their siblings. Restricting
transitions to siblings preserves the hierarchical structure of the model. Thus each individual
group of siblings works somewhat like a normal HMM. However, rather than terminating
the process, the special end state in each group of siblings is used to pass control up the
tree.
Furthermore, in an HHMM only the production states contain a distribution over the
output space O and can directly emit symbols. Instead of a distribution over the output
space, abstract states contain a start distribution over their children. Each time an abstract
state is entered, the HHMM samples the abstract state’s start distribution and passes control
immediately (in the same time-step) down the tree to the selected child. Control is passed
back to the parent after its child end state is entered, this allows the parent to make a transition
to one of the parent’s siblings. Each entry into an abstract state is considered to generate the
entire sequence of outputs emitted by the sub-HHMM rooted at the abstract state. Figure 2
contains a diagram representing an HHMM. Since the root is only a start distribution over
its children it is common to omit it from diagrams.
Springer
Mach Learn
Fig. 2 An example of a depth-2 HHMM (the root is omitted). White nodes represent states in the HHMM
while grey nodes represent the outputs emitted by the model
Like an HMM, the HHMM probabilistically generates a sequence of outputs by emitting
one output at each time step. At each time step the HHMM will be “in” one state at each
level and we use superscripts to denote depths in the tree (with the root at depth 0).
HHMM’s can have states organized in a directed acyclic graph rather than a tree, with
abstract states sharing sub-models. This sharing reduces the overall model size and can make
training easier since the shared structure only has to be learned once. We have not yet been
able to exploit this property and use a simple two-level tree structure in our experiments.
More elaborate hierarchies may be preferable when more training data is available.
It is important to note that for every HHMM, an equivalent “flat” HMM can be constructed.
This is done by creating a separate hidden state for each complete vertical path through the
HHMM, so that the one state qt in the flattened HMM encodes the complete tuple of states
(qt1 , qt2 , . . . , qtD ). Note that the flattened HMM has a special structure as many transitions are
impossible and some of the other transitions have constraints on their probabilities. Although
an HHMM and its flat HMM counterpart represent equivalent probability distributions over
emitted sequences, the loss of hierarchical model structure obscures any multi-scale interpretation of the data. Additionally, any shared model structure in a DAG HHMM will have
to be duplicated, increasing the overall model size.
4.2. The ESP model
The ESP system uses a 2-level HHMM for modeling the data. This hierarchy is used to simultaneously learn how score structure and performance variation correlate at two levels of
abstraction. Our current architecture has a number of top-level abstract states corresponding
to abstract musical contexts at the level of musical phrases. Each of these has a number of
production state children representing note level musical contexts. Therefore, the model captures note-level performance strategies conditioned on phrase context. Furthermore, having a
separate model for each phrase-level context makes it easier to gain insight from the learned
model.
For the experiments reported in Section 7 we use five top-level states each with eight
production state children as shown in Fig. 3. We used eight production-level states because
eight is the approximate average number of notes per phrase in our data set. The number
of top-level states was chosen in an ad-hoc attempt to balance computational considerations
and the perceived quality of the synthesized performances. Given the minimal amount of
prior information used in determining the model structure, it is likely that it could be further
optimized.
Springer
Mach Learn
Fig. 3 The basic HHMM used by ESP. There are five phrase-level states, each with eight note-level children.
Each group of siblings is ordered left-to-right and is actually fully connected (arcs have been removed to avoid
clutter). However, we initialize each state’s transition probabilities to decay exponentially with distance.
Our model captures the relationship between score and performance by using a joint
distribution over score and performance features in each production state. As discussed in
Section 3, tempo change is currently the aspect of expressive performance emphasized by the
hierarchical ESP models. (However, the ESP1 system described in Section 8 also represents
velocity).
Each production state contains a joint distribution over the eight note-level performance
and score features described in Section 3. These joint distributions are encoded as the product
of the following five distributions.
1. Two-dimensional Gaussian distribution (with general covariance matrix) over TempoFactor and TempoFactor.
2. Three-dimensional Gaussian distribution (with general covariance matrix) over RelativeDuration, Pitch, and PhraseOffset features.
3. Multinomial distribution over MetricStrength values.
4. Multinomial distribution over PhraseBoundary values.
5. Bernoulli distribution over LastNote values.
The distributions over score features reflect how applicable the state is to the various
notes in the score, and the distribution over performance features guide the tempo of the
synthesized notes.
Springer
Mach Learn
We use a natural initialization of transition probabilities to avoid symmetries within groups
of siblings and enforce a correspondence between the hierarchical model and the phrase structure of the score. Each group of production state siblings is linearly ordered. The first state
emits the “first” value, the last state emits the “last” value, and the other states emit the
“internal” value for the PhraseBoundary feature. The transition probabilities are initialized
to prefer nearby states in the linear order, falling off exponentially as the distance increases.
Transitions between the phrase-level nodes are initialized similarly. Finally, the last production state child of the last phrase-level state is initialized to emit the LastNote value true
while all other production states emit false. Thus the PhraseBoundary and LastNote features are deterministically emitted based on the initialization and are not learned from the
data.
5. Model training
The goal of model training is to find model parameters θ that maximize the likelihood that
the model generates the observed data. The first step in training is to flatten the HHMM
into a (non-hierarchical) HMM. We can then use the Expectation-Maximization (EM) algorithm (Dempster, Laird & Rubin, 1977) to train the flattened HHMM from a set of performances. In the Expectation step the EM algorithm estimates, for each note and production
state pair, the posterior probability (given the current model parameters) that the HHMM
used that production state to emit the note’s features. The maximization step uses these estimates to revise the model parameters in order to increase the likelihood of generating the
data sequence. There are several good surveys describing the EM algorithm and how it can be
applied to learning HMM models (Bengio, 1999; Bilmes, 1997). We follow Brand in using
an entropic estimator (Brand, 1999b) that encourages sparseness when revising the model
parameters.
5.1. Computing the expected sufficient statistics
The Expectation step of EM requires us to compute the occupancy distributions, the probabilities of being in each state of the HMM at each time-step of each performance in the
training data. This process is referred to as inference. Murphy and Paskin propose a fast
method for exact inference in HHMMs based on treating the HHMM as a dynamic Bayesian
network (Murphy & Paskin, 2002). As more data becomes available and the HHMM models
become more elaborate, the more powerful dynamic Bayesian network techniques are likely
to become increasingly important. However, we found that flattening the HHMM and using
the standard forward-backward algorithm (also known as the Baum-Welch algorithm) on the
flattened HMM was sufficient for our (2-level) models.
Recall from Section 4.1 that the flat HMM corresponding to an HHMM contains one
HMM state for each tuple of states in the HHMM that can be active at the same time (each
path from the root to a production state). In our HHMM model we have n p = 5 phrase level
parent states each with n c = 8 note level children states, so the corresponding flat HMM
has n p n c states. This means that the standard forward-backward algorithm can compute the
needed probabilities in time O(n 2p n 2c T ) where T is the total number of notes in the training
data.
Once it has computed the occupancy distributions, the standard EM algorithm adjusts the
transition probabilities and output distributions at each node to maximize the likelihood with
respect to the previously computed state occupancy probabilities (Bengio, 1999; Bilmes,
Springer
Mach Learn
1997). Instead of performing this straightforward optimization we incorporate an entropic
prior favoring sparseness in the transitions.
5.2. The entropic MAP estimator
We use a modified version of the Expectation-Maximization algorithm to train the multinomial state transition parameters in our model. This replacement for the maximization step
of EM is called the entropic estimator (Brand, 1999b). The entropic estimator incorporates
the belief that sparse models (those whose parameter distributions have low entropy) are
generally preferable to complex models. This bias is motivated by the fact that sparse models
tend to be more resistant to the effects of noise and over-fitting, making them better representations of the underlying structure present in the data. We can incorporate this belief into
a prior distribution on each set of multinomial parameters, θ using the entropic prior.
Pe (θ ) ∝ e−H (θ ) = e
i
θi log θi
=
θiθi = θ θ
(1)
i
where H (θ) = − i θi log θi is the Shannon entropy of the multinomial distribution, θ . While
classic EM tries to maximize the likelihood of the model parameters θ given the data X ,
L(θ | X ), the entropic estimator tries to find the maximum a posteriori distribution formed
by combining the model likelihood with the entropic prior. We can formulate this posterior
using Bayes’ rule as follows:
P(θ | X ) =
Pe (θ )L(θ | X )
P(X )
(2)
=
e−H (θ ) L(θ | X )
P(X )
(3)
∝ e−H (θ ) L(θ | X )
(4)
The proportionality in Eq. (4) is justified because P(X ) does not depend on θ and is therefore
a constant. It is often more convenient to work with the log-posterior.
log P(θ | X ) = log Pe (θ ) + log L(θ | X ) − log P(X )
= −H (θ ) + log L(θ | X ) − log P(X )
(5)
(6)
The relative importance of the prior and the data likelihood can be adjusted with a trade-off
parameter η. Furthermore, since P(X ) is independent of θ , maximizing the log-posterior
(with the trade-off factor) is equivalent to maximizing the following entropic estimator:
−ηH (θ) + log L(θ | X ).
(7)
When η = 0, the entropic estimator becomes equivalent to EM. An η < 0 encourages maximum structure by rewarding entropy in an effort to model the data as fully as possible. This
setting is useful in the early stages of an annealing schedule (discussed below in Section 5.4).
When η > 0 (specifically when η = 1, see (Brand, 1999b) for details) minimum entropy solutions are encouraged and model complexity is punished. Varying η during training, typically
Springer
Mach Learn
from η < 0 up to η = 1, can help to keep the model parameters from getting stuck in local
minima (again, see Section 5.4 for details).
Deriving the entropic estimator is somewhat involved, so we refer the reader to Brand
(1999b) for details. Given the expected sufficient statistics (empirical probabilities of the
multinomial outcomes) ω, the entropic estimator θ̂ for a multinomial distribution θ is:
θˆi = −
ωi
ηW − ωηi e1+λ
(8)
where W is the Lambert W function (Corless et al., 1996) and λ is a Lagrange multiplier
ensuring that the θ̂i sum to one. There is a fast iterative method for simultaneously solving
for λ and the θ̂i (Brand, 1999b).
5.3. Parameter trimming
As with most learning algorithms, graphical models can benefit tremendously from prior
knowledge. Significant increases in model quality can often be achieved by encoding information about the problem domain into the model structure.5 In our case, we can impose
some structural constraints (like forcing the start and end of phrases in certain states), but
otherwise we are ignorant about how the model should be structured. This leaves us with little
choice but to fully connect both the production states under each phrase-level abstract state
as well as fully connect the phrase-level abstract states themselves. This is an undesirable
situation when training with EM. Because EM only tries to maximize data likelihood, it will
make use of the full dynamic range of the state space to model noise as well as other aspects
of the data that are not statistically well supported. Thus, fully connected models (with an
adequate number of states) that have been trained using EM are likely to overfit, be overly
complicated, and generalize poorly.
The entropic estimator addresses these issues with its bias towards “informative” parameter
settings. However it also provides a way to roll structure learning into the training process.
By introducing trimming criteria into the training cycle, we can begin with an overly-general
(i.e. fully connected) model and let the data determine an appropriate structure.
Trimming works by exploiting the fact that we are working to increase the log-posterior of
the model (Eq. (7)) rather than just the log-likelihood of the training data alone. Therefore, we
can afford to remove a parameter (possible transition) if the reduction of entropy outweighs the
decrease in log-likelihood. Using the notation Ai→k for the probability that state i transitions
to state k, Ai→∗ for the vector of transition probabilities out of state i, and Ai→ k for the
vector of transition probabilities out of state i to any state other than k, we now present the
trimming criteria used in ESP for transition parameters.
η(H (Ai→∗ ) − H (Ai→ k )) > log L(Ai→∗ | X ) − log L(Ai→ k | X )
(9)
Inequality 9 compares the reduction in entropy to the reduction in log-likelihood (with tradeoff parameter η) when the transition from state i to state k is removed.
5
Because EM can never revive a parameter once it has zeroed out, setting initial transition or observation
probabilities to 0 provides model structure.
Springer
Mach Learn
Following (Brand, 1999b), we then approximate the right-hand side of Eq. (9) as follows.
log L(Ai→∗ | X ) − log L(Ai→ k | X ) ≈ Ai→k
∂ log L(Ai→∗ | X )
∂ Ai→k
(10)
Simplifying the left side of Eq. 9 and using Approximation 10 on the right gives the following
trim test:
η Ai→k log(Ai→k ) > Ai→k
1 ∂ log L(Ai→∗ | X )
η
∂ Ai→k
1 ∂ log L(Ai→∗ | X )
< exp −
η
∂ Ai→k
log Ai→k < −
Ai→k
∂ log L(Ai→∗ | X )
∂ Ai→k
(11)
(12)
(13)
The inequality in Eq. (13) provides a cheap test for trimming because the gradient is easily
calculated from the state occupancy probabilities computed in the E-step. If the inequality
in Eq. (13) holds then removing the transition from state i to state k (setting the transition
probability to 0) will increase the optimization criteria in 7 (modulo the approximation).
5.4. Annealing the entropic prior
As mentioned above, traditional EM suffers from the tendency to get stuck in local maxima
of the probability space defined by the model parameters.6 The degree to which this happens
is largely a function of the initial parameter settings we assign to the model.
The intuition behind annealing, is that at high “temperatures” (negative values of η in the
case of the entropic MAP estimator), we are smoothing the energy surface of the function
that we are trying to maximize. This gives the algorithm a better chance of getting into the
right “neighborhood” of the function’s global maximum.
We can anneal the posterior distribution (Eq. (7)), which the entropic MAP estimator tries
to maximize, by changing the weight of the entropic prior. To do this we vary η continuously
over the course of all training iterations, typically starting at around −9 and ending at 1.
There is little formal theory to guide our choice of “cooling” schedule for η. However, as
mentioned previously, η = 0 and η = 1, are special cases which correspond to EM and the
minimum entropy estimator, respectively. Generally, we begin training with η < 0 so that
the training process begins by exploring many different parameter settings. Depending on
our goals for training, we might choose to stop at η = 0 (if we were more concerned with
model fit than generalization) instead of allowing the algorithm to seek the minimum entropy
solution.
During the first few iterations of training, we want to give the algorithm a chance to explore
the target function’s energy surface at a larger scale. Then, as we reduce the temperature,
more detail comes into view, allowing us to settle into a (hopefully) global maximum.
6 This space is also referred to as the energy surface, a term born out of simulated annealing’s origins in the
physics world. We refer to both in our discussions below.
Springer
Mach Learn
Table 3 Comparison of models produced by training with EM versus those produced by training with the
entropic prior. Here is the smallest positive value represented by the system (MATLAB). Eight models were
trained using each training algorithm (with identical starting conditions) and the results were averaged
EM
Entropic
Log-likelihood
−7336
−7174
Perplexity (thresh = .001)
7.4965
6.9427
Perplexity (thresh = 2)
11.3073
7.7587
5.5. The benefits of the entropic prior
Brand has conducted several experiments comparing the entropic estimator to EM, with a
focus largely on the former’s advantages in terms of model complexity (Brand, 1999; Brand,
1999). We performed a quick test to validate the use of entropic prior with trimming and
annealing for our system. Empirically, we find that entropic estimation produces better models
than EM, both in terms of log-likelihood and model complexity. We trained using both EM
and the annealed entropic estimator with parameter trimming on the entire student data set
and measured the perplexity (average number of transitions out of each state) and data fit of
the resulting models. Using the same random initializations, we trained 8 times with each
technique and averaged the results. Because some state transitions may have very small
probabilities, we report perplexity results using two different threshold values (the threshold
is defined as the minimum probability that a parameter must have in order to be taken
into account when computing perplexity). The results given in Table 3 show that entropic
estimation not only produces sparser models, but also ones tending to fit the data better.
Surprising as it may seem, the entropic estimator tends to produce superior models. EM
techniques are subject to local minima and can set parameters to zero too early in the optimization process. The annealing schedule helps preserve parameters and smooth the likelihood
surface early in the optimization process. The value of η becomes positive after a good region is found (hopefully). This has the effect of encouraging model sparsity and reducing
over-fitting.
6. Synthesis
After training, the ESP system uses the HHMM model to synthesize renditions of new scores.
More precisely, the system creates a sequence of performance features corresponding to the
melody notes in the score. We first find a candidate sequence of performance features f p ,
and then iteratively update f p using the Expectation-Maximization (EM) algorithm.
6.1. Generating performance features
Let M be the learned model, T be the number of melody notes in the new score, and f s the
sequence of score features corresponding to the new score (created in the same way as the
score features were created from the training score). It is easiest to describe performance
synthesis in terms of the flattened HHMM, so the following treats the model M as a
(non-hierarchical) hidden Markov model.
To create the initial sequence of performance features the system finds the most likely
state sequence q = q1 , . . . , qT for the score sequence f s in the trained model M. Thus
state sequence q maximizes the probability P(q, f s | M) ignoring the emitted performance
features, and is a kind of Viterbi path through M. We then take each state qt in q and sample
Springer
Mach Learn
from its performance feature output distribution P( f p | qt ) to get the initial performance
features for time t. In principle, we can initialize f p arbitrarily, but this process gives the
EM algorithm a good starting point, speeding convergence and reducing problems with local
minima. Note that path q need not be the path through M that maximizes P( f p , f s | q, M),
so further improvement to the synthesized performance is possible using the EM algorithm.
The Expectation-step computes the occupancy probabilities λt (i) for every state si in
M and time 1 ≤ t ≤ T . Each λt (i) is the probability that the model is in state i at time t
conditioned on the sequence of outputs, and is easily computed with the forward-backward
algorithm.
From the derivation in Wang et al. we get the following Maximization step formula for
p
the f t feature values (Wang et al., 2003).
p
ft =
i
λt (i)
μiP
iP
−1
λt (i)
iP
(14)
i
where μiP and iP represent the mean and covariance matrix for the performance feature
distribution in state i of the model.
6.2. Feature smoothing
Recall from Section 3 that there are two real valued performance features for each note, the
TempoFactor reflecting the relative tempo at the note and the TempoFactor measuring the
change in relative tempo at the note. Since the human ear easily picks up changes in tempo
(and volume), neither feature alone is sufficient to produce good quality renditions. If only
the TempoFactor is used then the occupancy probabilities at time t can favor states with
slow TempoFactors, while the occupancy probabilities at time t + 1 can favor states with
faster TempoFactors. This causes erratic and uncomfortably abrupt tempo changes in the
synthesized performance. On the other hand, using only the TempoFactor feature results
in a different kind of problem. When the occupancy probabilities for several consecutive
notes all emphasize the states with high (low) TempoFactors, then the cumulative effect
makes the synthesized performance unrealistically fast (slow). Our solution is to combine
these two seemingly redundant features in order to produce smooth and reasonable tempo
variations.
Assume that we are about to synthesize the tempo for a particular note using the tempo at
the previous note and the distribution over TempoFactor and TempoFactor in a particular
production state. The natural approach is to assign the note a tempo such that the joint
probability of the resulting TempoFactor and TempoFactor is maximized. This enforces a
compromise between the redundant (and perhaps conflicting) features with the covariance
matrix of the distribution influencing the degree to which each feature compromises.
The situation when synthesizing a performance is more complicated since the previous
note’s tempo can be adjusted if it improves the overall fit for the whole piece. As described
by Wang et. al., the second-order features (TempoFactors in our case) can be expanded into
differences of first order features just before taking derivatives of the auxiliary Q function in
the EM algorithm. This results in a system of linear equations whose solution optimizes the
first order features with respect to the second order information.
Springer
Mach Learn
Fig. 4 Score of the melody line adapted from Chopin’s Etude Op. 10 No. 3 and used as test data for our
synthesis experiment
7. Experiments
This section describes our experiments with the ESP system. All experiments were implemented in MATLAB and used the BayesNet (Murphy, 2004) toolbox for some of the machine
learning algorithms. Raw experimental results will be available for download at:
http://www.media.mit.edu/∼grindlay/ESP/ESP.html.
7.1. Synthesizing performances of novel scores
In our first experiment, we compare the results of the ESP synthesis algorithm to real performances. We trained an HHMM on the students’ performances of Träumerei and Prelude
No. 15 in D, eight separate times. Each model, Mi , had 5 phrase-level states with 8 note-level
states in each phrase sub-model, giving a total of 40 production states. The start distributions
of both the phrase and note-level models were uniform. Transition probabilities were set to fall
exponentially off of the diagonals and emission probabilities were initialized randomly (with
the exception of the phrase distributions which use the fixed values described in Section 4).
The entropic prior was annealed during the training process according to a cooling schedule
that increased from −9 to 1 geometrically. We allowed transition parameter trimming on all
levels once the temperature rose above 0.1.
Once trained, we used the score from Etude Op. 10 No. 3.(see Fig. 4) to render a synthetic
synth
tempo-factor curve, f i
, from each model Mi , according to the algorithms described in
Section 6. We report on synthetic renderings of the Etude because its notes’ features are
well-represented in the other scores. In contrast, both Träumerei and Prelude contain notes
whose score features do not appear in the other two pieces. The resulting tempo-factor curves
synth
were then averaged7 yielding, f avg
. Next, we computed the average of the 7 tempo-factor
stud
. We
curves extracted from the student performances of Etude Op. 10 No. 3, yielding f avg
then computed the correlation, mean square error (MSE), and mean absolute error (MAE)
synth
stud
of f avg
relative to f avg
. We report these three measures since it is unclear how one can
reasonably quantify the perceptual similarity of performances. In order to provide a basis
for comparison, we also computed the MSE and MAE of a flat tempo-factor curve, f f lat ,
stud
relative to f avg
. This flat curve represents a performance with no variations in tempo.
7
The individual tempo-factor curves showed some variation, more than one would expect from the similar
likelihoods of the trained models. This leads us to suspect that we did not have enough data to fully overcome
the random initialization of emission probabilities, and thus averaging several trained models is a reasonable
approach. It is likely that better results could be achieved if either more training data were available or the
emission probabilities were initialized to musically sensible values.
Springer
Mach Learn
Table 4 Comparison of
synthesized and flat tempo curves
relative to human performance
synth
f avg
f lat
f
stud
vs. f avg
stud
vs. f avg
MAE
MSE
Correlation
.1265
.1060
.0200
.0169
.5045
−
2
1.8
1.6
Tempo Factor
1.4
1.2
1
0.8
0.6
0.4
0
5
10
15
Notes
20
25
30
Fig. 5 The average synthetically generated tempo-factor curve and the performances of 7 student performers.
The vertical lines denote phrase boundaries
Figure 5 shows the synthetic tempo-factor curve along with the tempo-factor curves of
the 7 student performers. As indicated in Table 4, the average synthetic curve was reasonably
correlated with the average student performance, but did not perform as well as the flat
curve under the average and squared error metrics. This is likely the result of our relatively
small training set, although differences in student performance style may have played a
part as well. Despite the somewhat disappointing error levels, the synthetic curve does have
some interesting properties. Perhaps most importantly, the synthetic curve follows the phrase
structure in the score. Other researchers have shown that the tempo curves of individual
phrases can generally be well described by second-order polynomials (Todd, 1992; Repp,
1992). From Fig. 5 we can see that the synthetic tempo curve contains (roughly) concave
subsections that align nicely with the phrasing of the score. These phrase shapes agree with
the general trends of the student performances.
It is worth noting that other researchers using nearest neighbor techniques found expressive
piano tempo much harder to synthesize than dynamics (Tobudic & Widmer, 2003). Their
methods, experimental setup, and data (performances of Mozart piano sonatas) are very
different from ours, making a direct fair comparison difficult. Although they had excellent
results with dynamics, many versions of their algorithm often performed worse than a constant
tempo synthesis. They report correlation factors ranging from 0.17 to 0.81 (with averages for
different algorithms ranging from 0.28 to 0.38). The correlation factor between the tempo
synthesized by ESP and the average of the student performances is greater than 0.5.
Springer
Mach Learn
Table 5 Comparison of HHMM
and HMM synthesized tempo
curves relative to human
performance
HHMM
stud
f
vs. f avg
HMM
stud
f
vs. f avg
MAE
MSE
Correlation
.1265
.2119
.0200
.0616
.5045
.0739
7.2. Comparison of HHMMs and HHMs
There are several reasons why a hierarchical approach to performance modeling is beneficial. Aside from potential advantages in examining trained models for musicological insight, it is our contention that HHMMs are better suited to the synthesis task than are flat
HMMs. We tested this belief by directly comparing the synthesized tempo curves produced
by a flat HMM with those produced by the HHMM described above in Section 7.1. We
used the same initial starting conditions and the same number of total states in both models. We trained eight HMMs on the same student data used with the HHMM, synthesized
eight tempo curves of Etude Op. 10 No. 3, and then averaged them. The resulting tempo
curve was then compared with the average student performance in the same manner as described in Section 7.1. A comparison of the HMM and HHMM average results is given in
Table 5.
We can see that not only are the MAE and MSE values significantly lower for the HHMM
generated curve, but the HHMM generated curve has a much higher correlation coefficient
than the curve generated by the HMM. This provides evidence that the hierarchical form
allows the model to better capture the multiscale structure present in the tempo and score
data.
7.3. Recognizing skill (or Lack Thereof): Professional vs. student pianists
It is generally accepted that expert performers exhibit a greater degree of stylistic individuality
than lesser skilled performers do. However, there is also evidence that, despite individual
differences, the average performance profile of professional performers is quite similar to
that of student performers (Repp, 1997). This suggests that a common performance standard
exists for performers of varying skill level.
One of the advantages of probabilistic models, such as ESP, is that they can provide
quantitative answers to questions of data and model similarity. In this experiment, we sought
to compare student and professional performances both individually and as groups. We
pr o f
trained a separate model, θi , on each of the 28 professional performances of Träumerei,
pr o f
f i , and a separate model, θistud , on each of the 10 student performances of Träumerei,
f istud . We then computed separate log-likelihood values for each performance under each
of the 38 models. Finally, we computed the means and standard deviations of the average
log-likelihoods of performances under each model. This was done for each combination of
model and performance class:
28
pr o f
pr o f pr o f 1 LL θi
| f pr o f =
log P f j | θi
28 j=1
10
pr o f
1 pr o f LL θi
| f stud =
log P f stud
| θi
j
10 j=1
Springer
Mach Learn
10
1 LL θistud | f stud =
log P f stud
| θistud
j
10 j=1
28
pr o f
1 LL θistud | f pr o f =
log P f j | θistud
28 j=1
The summary statistics are given in Table 6. Figure 6 shows the log-likelihoods under the
student models while Figure 7 shows log-likelihoods under the professional models.
As can be seen in Table 6 and Fig. 6 and 7, not only are the standard deviations significantly
higher under the professional models, but the mean log-likelihood of professional performances under professional models is actually lower than it is for professional performances
under student models. Additionally, the mean log-likelihood of student performances under
student models is significantly higher than the mean of professional performances under professional models. Collectively, these results support the claim that professional performance
style varies significantly more than does student performance style.
Table 6 Summary statistics for the log-likelihood values of student and professional performances under student and professional models
pr o f
LL(θi
Mean
Std. dev.
| f pr o f )
−437.8122
51.0898
pr o f
LL(θi
| f stud )
−358.5902
63.0138
LL(θistud | f stud )
LL(θistud | f pr o f )
−295.8087
35.3700
−421.4574
39.5231
2
Professionals
Students
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
–500
–450
–400
–350
Log–Likelihood
–300
–250
–200
Fig. 6 Histogram of the averaged log-likelihood values for professional and student performances under the
10 student models
Springer
Mach Learn
8
Professionals
Students
7
6
5
4
3
2
1
0
–550
–500
–450
–400
–350
–300
–250
–200
Fig. 7 Histogram of the averaged log-likelihood values for professional and student performances under the
28 professional models
Although these summaries help us to understand the general differences in student and
professional model variance, they don’t provide information about the global relationships
between individual performers. In an effort to gain some insight into these relationships,
we formed a matrix of log-likelihood values for each of the 38 performances (columns)
under each of the 38 models (rows). We then zeroed the diagonal by subtracting the diagonal
elements from their corresponding rows and then added the resulting matrix to its transpose.
The resulting matrix M gives a dissimilarity measure on performances. We performed a
non-metric multidimensional scaling of M down to two dimensions. The result is shown in
Fig. 8.
Figure 8 provides an interesting view of how the performers cluster. Again we observe a
significantly larger model variance for the professionals than for the students. We can also
see that the student models cluster fairly close to the centroid of the professional models,
suggesting that the student performances are very close to the average of the professional
performances. This agrees with the findings of Repp (1997).
7.4. Recognizing individual performers
7.4.1. Professionals
A related, but perhaps more challenging problem than performer skill classification, is the
recognition of individual performers. Recall that our professional data set contains three
performances by both Alfred Cortot and Vladimir Horowitz. To test whether or not ESP
was capable of recognizing an unseen performance of Träumerei by a specific pianist, we
conducted two separate experiments, one for Cortot and one for Horowitz. In each experiment,
Springer
Mach Learn
600
500
Students
Professionals
Cortot 3
Cortot 2
400
Cortot 1
300
Horowitz 2
200
100
0
–100
–200
Horowitz 1
–300
Horowitz 3
–400
–600
–400
–200
0
200
400
600
Fig. 8 A 2D multidimensional scaling of the symmetric differences between individual performance models
we trained a model on two of the performer’s performances and used the third performance
as the test piece. Then we computed the log-likelihood of the test performance as well as
the performances of all other professional performers. Each experiment was three-fold cross
validated by running the experiment three times, once for each permutation of training and
test performances.
For the Cortot recognition experiment, the results were quite good. The model succeeded
in recognizing (assigning the highest log-likelihood) to the Cortot test performance in all
three cases. This is not surprising given that Cortot’s models occupy a fairly isolated area of
the dissimilarity space in Fig. 8.
The results of the Horowitz recognition experiment were somewhat mixed. In this case,
the model correctly recognized the test performance in one of the three cases, assigned
the third highest score in another of the cases, and assigned the fifteenth highest score in
the third case. Again, looking at Fig. 8, this is not terribly surprising. Although two of the
Horowitz performance models appear to be very similar, they are also similar to several other
professional performance models. While Horowitz’s third performance model is relatively
isolated from the other performer’s performance models, it is also fairly dissimilar to his
other two performances.
7.4.2. Students
Given that our results up to this point generally seem to indicate that students tend to have
more similar performance styles than professionals do, it seems reasonable to expect that
student performer recognition would be a more difficult task than professional performer
recognition. We tested this idea by running the same set of experiments as described in
Section 7.4.1, but on our set of student performances of Prelude No. 15 in D. Because
two of the seven performers in this set, ALT and PYL, each had recorded three usable
takes of Prelude No. 15 in D, we used them in the experiment. As before, two separate
experiments were run, one for each performer. Each experiment was three-fold cross validated
Springer
Mach Learn
by running it three times, once with each of the performance takes rotated in as the test
performance.
The results of the student performer identification task were very good. The model correctly identified ALT’s test performance in each of the three experimental runs. Interestingly,
in all three runs, ALT’s third performance received higher log-likelihood scores than ALT’s
other two performances, even when it was the test performance.
PYL’s identification results were also good. ESP correctly identified the performer in one
of the three experimental runs. In one of the other runs, the test performance received the
second highest score (and was within 3% of the highest log-likelihood). The third run was
particularly interesting because not only did the test run receive the second highest score, but
the top-ranked performance actually scored higher than one of the performances on which
the model was trained.
There are several possible reasons why our student recognition results were better than
one might have expected. First, the set of performances being scored was much smaller
than in the professional performer recognition experiment (16 − 2 = 14 performances in the
student recognition experiment versus 28 + 10 − 2 = 36 performances in the professional
recognition experiment). Second, Prelude No. 15 in D may have more interpretative “room”
than Träumerei. Finally, not only were the student performances recorded over a short period
of time (unlike the professional performances), but the relative lack of individual style might
have the effect of reducing the variation between their performances.
8. Synthesizing tempo and dynamics with the original ESP system
The ESP tempo system described in the previous sections is actually the second incarnation
of the ESP system. Before we obtained the professional data set we built an earlier version of
ESP specifically for the richer student data set. To avoid confusion we will call the previously
described system ESP and the version described in this section ESP1. Since the student data set
contains velocity (loudness) information, the ESP1 system uses (and synthesizes) additional
performance features. We conducted a listening test with the ESP1 system to compare its
synthesized performances with the melody line produced by student performers. Here we
briefly describe how ESP1 differs from ESP and the results of our experiment. See Grindlay’s
thesis (Grindlay, 2005) for further details on the ESP1 system.
8.1. The ESP1 system
ESP1 uses standard HMMs rather than the hierarchical HMMs. The features used by ESP1
are similar in spirit to those described in Section 3. However, the additional detail in the
student data enabled us to use first- and second-order velocity features as well as additional
timing features.
We also tried pitch and articulation score features and articulation performance features,
but they did not seem to improve the quality of the synthesized performances. With the
limited size of the student data set increasing the number features diluted the model’s ability
to generalize.
8.2. Listening test
To test the ESP1 system we performed the following informal listening test. An HMM was
trained on 10 performances in the student data set of Schumann’s Träumerei. This HMM
Springer
Mach Learn
Table 7 Summary results of an informal listening test where 14 undergraduate music students ranked three
versions of 10 bars (melody only) from Chopin’s Etude Op. 10 No. 3. Each student heard an inexpressive
version rendered literally from the score, ESP1’s expressive synthesis, and either a performance by an
undergraduate or recent graduate in piano performance and ranked them from best to worst
Times
ranked
Inexpressive
performance
Human
Ugrad + Grad = Total
ESP1
synthesis
Best
Middle
Worst
2
5
7
1+2=3
3+5=8
3+0=3
9
1
4
was then used to generate an expressive synthetic performance of a different piece, the first
10 bars of the melody from Chopin’s Etude Op. 10 No. 3. Three other renditions of the
10-bar melody line were recorded as well: a literal synthetic rendition of the score (i.e. no
expressive variation), a performance by a local music graduate student, and a performance by
an advanced undergraduate majoring in piano performance.8 Each of the human performers
was given a warm-up run and then recorded three times with the best sounding recording (to
our ears) used in the experiment. The human performers played only the melody line of the
10 bar segment on a Yamaha Clavinova and recorded in MIDI format with all pedalling data
removed.
We arbitrarily selected fourteen music students from the halls of the music building to
participate in the listening test. Each subject listened to and ranked three performances: the
expressive synthetic, the inexpressive synthetic, and one of the two human performances
(without knowing which was which). The performances were presented in random order,
and the listener could re-hear performances as often as desired. The frequencies of each
performance/ranking pair are given in Table 7.
Although far from conclusive, the experiment did show some interesting trends. First,
although some listeners (4 of 14) preferred the inexpressive score to the expressive synthesis,
most (10 of 14) ranked the expressive synthesis higher. The expressive synthesis also was
clearly preferred over the undergraduate’s performance (5 of 7 times). The expressive synthesis seemed at least comparable (preferred 4 of 7 times) to that of the graduate student, who
presumably has a similar skill level as the students producing the training data. This gives
modest evidence that our system is successfully able to synthesize performances of novel
scores at a skill level approximating that of the performers generating the training data.
9. Conclusions
The problem of synthesizing expressive performance is both exciting and challenging. Musical performance is one of the many activities that trained people do very well without
knowing exactly how they do it. Other examples abound from the physical tasks of catching
a ball and driving a car to the more cognitive tasks of reading handwriting, recognizing spoken words, and understanding language and images. Systems that can produce reasonable
results for problems such as musical performance are extremely useful not only because of
8 To be honest, we also recorded a professional pianist playing the Etude, but we judged that her performance
was superior to our training data, and so omitted it from the listening tests. One cannot expect an automated
system to generate performances better than those it was trained with.
Springer
Mach Learn
their utility in practical applications, but also because they can help to illuminate the tacit
knowledge employed by skilled performers.
The difficulty of producing high-quality data and the scarcity of public data sets makes it
extremely difficult to compare and evaluate the different approaches to learning expressive
performances. We believe that the hierarchical sequence structure of music is a natural match
for hierarchical hidden Markov models and that our results provide good evidence for the
effectiveness of this approach. The use of entropic estimation helps produce sparser and
better models, and its use in other applications should be considered. We originally intended
to use multiple levels of phrasing in our models, but the simple two-level models reported
here performed at least as well in our preliminary experiments. We expect that more complex
models will be better suited to larger and more diverse data sets.
Our initial approach to performance synthesis addresses only a small part of this difficult
problem. Even with an instrument like the piano with relatively simple dynamics there are
important aspects that we ignore. Consideration of both melody and harmony presents a
number of new challenges because of the dramatic increase in performance variation possibilities. Although we have ideas on how to represent some of the additional dimensions (like
the “rolling” of chords), we do not yet have a good grasp of all the important aspects requiring additional features. Even in piano melodies we ignore potentially important performance
aspects like staccato and legato, grace notes, and the use of the pedals in order to focus on
the more fundamental tempo and dynamics issues. Although we have tackled only a portion
of the “real” problem, our techniques are easily generalized and can scale up as more and
richer data becomes available.
Given our restricted focus and limited data we have achieved reasonable results when synthesizing performances of new scores. The listening test indicates that people rank ESP1’s
performance at least comparably to that of pianists having (approximately) the same skill
level as the performers producing the data. More quantitative comparisons with human performances show that ESP learns and reproduces much of the tempo variation structure used
by trained pianists (see Fig. 5).
The generative models learned by ESP have uses beyond synthesis. For example, our
experiments showed that they were often surprisingly good at recognizing individual performers. The various sub-models provide a concise representation of performance strategy
that could be interpreted by a human expert to gain new musicological insight. These models
provide confirmation of earlier findings (Repp, 1997) regarding the degree of expressive
variation employed by professional and trained non-professional performers. In combination
with projection techniques such as MDS, the ESP modeling scheme provides an intuitive as
well as quantitative similarity measure for both performers and performances. Finally, models like those produced by ESP would be very useful for a number of musical information
retrieval tasks such as performance search, classification, and recommendation systems.
Acknowledgments We thank Bruno Repp for generously allowing use to work with his performance data as
well as Todd Marston and Tev Stevig for helping with score analysis. We also greatly appreciate the many
helpful comments and suggestions made by the special issue editor and the anonymous reviewers.
References
Arcos, J., & de Mántaras, R.L. (2001). An interactive cbr approach for generating expressive music. Journal
of Applied Intelligence, 27(1), 115–129.
Bengio, Y. (1999). Markovian models for sequential data. Neural Computing Surveys, 2, 129–162.
Springer
Mach Learn
Bilmes, J. (1997). A gentle tutorial on the EM algorithm and its application to parameter estimation for
gaussian mixture and hidden Markov models. Technical Report ICSI-TR-97-021, University of California
at Berkeley.
Brand, M., & Hertzmann, A. (2000). Style machines. In: Proceedings of ACM SIGGRAPH 2000 (pp. 183–192).
Brand, M. (1999a). An entropic estimator for structure discovery. In M. J. Kearns, S. A. Solla, and D. A. Cohn
(Eds.), Advances in Neural Information Processing Systems 11. MIT Press: Cambridge, MA.
Brand, M. (1999b). Pattern discovery via entropy minimization. In: D. Heckerman and C. Whittaker (Eds.),
Artificial Intelligence and Statistics, Morgan Kaufman.
Brand, M. (1999c). Structure learning in conditional probability models via an entropic prior and parameter
extinction. Neural Computation, 11(5), 1155–1182.
Brand, M. (1999d). Voice puppetry. In: A. Rockwood (Ed.), Proceedings of ACM SIGGRAPH 1999 (pp.
21–28), Los Angeles.
Bresin, R., Friberg, A., & Sundberg, J. (2002). Director musices: The KTH performance rules system. In:
Proceedings of SIGMUS-46, Kyoto.
Casey, M. (2003). Musical structure and content repurposing with bayesian models. In: Proceedings of the
Cambridge Music Processing Colloquium.
Cemgil, A., Kappen, H., & Barber, D. (2003). Generative model based polyphonic music transcription. In: D.,
Heckerman and C. Whittaker (Eds.), IEEE Workshop on Applications of Signal Processing to Audio and
Acoustics. New Paltz, NY.
Cemgil, A., & Kappen, H. (2003). Monte carlo methods for tempo tracking and rhythm quantization. Journal
of Artifical Intelligence Research, 18, 45–81.
Corless, R., Gonnet, G., Hare, D., & Knuth, D. (1996). On the lambert W function. Advances in Computational
Mathematics, 5, 329–359.
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum-likelihood from incomplete data via the em algorithm.
Journal of the Royal Statistics Society, Series B, 39(1), 1–38.
de Mántaras, R.L., & Arcos, J. (2002). AI and music: From composition to expressive performances. AI
Magazine, 23(3), 43–57.
Dixon, S. (2001). Automatic extraction of tempo and beat from expressive performances. Journal New Music
Research, 30(1), 39–58.
Fine, S., Singer, Y., & Tishby, N. (1998). The hierarchical hidden Markov model: Analysis and applications.
Machine Learning, 32(1), 41–62.
Grindlay, G. (2005). Modeling expressive musical performance with hidden Markov models. Master’s thesis,
Dept. of Computer Science, U.C., Santa Cruz.
Lang, D., & de Freitas, N. (2005). Beat tracking the graphical model way. In: L. K. Saul, Y. Weiss, and L.
Bottou, (Eds.), Advances in Neural Information Processing Systems 17 (pp. 745–752). Cambridge, MA:
MIT Press.
Lerdahl, F., & Jackendoff, R. (1983). A Generative Theory of Tonal Music. MIT Press.
Murphy, K.P., & Paskin, M.A. (2002). Linear-time inference in hierarchical hmms. In: T. G. Dietterich,
S. Becker, and Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems 14 (pp. 833–
840), Cambridge MIT Press.
Murphy, K. (2004). The BayesNet toolbox. URL: http://bnt.sourceforge.net.
Raphael, C. (2002a). Automatic transcription of piano music. In D. Heckerman and C. Whittaker, (Eds.),
Proceedings ISMIR. Paris. France.
Raphael, C. (2002b). A Bayesian Network for real-time musical accompaniment. In T.G. Dietterich, S. Becker,
and Z. Ghahramani, (Eds.), Advances in Neural Information Processing Systems 14, (pp. 1433–1439).
Cambridge, MA: MIT Press.
Repp, B. (1992). Diversity and commonality in music performance: An analysis of timing microstructure in
schumann’s trumerei. Journal of the Acoustical Society of America, 92, 2546–2568.
Repp, B. (1997). Expressive timing in a debussy prelude: A comparison of student and expert pianists. Musicae
Scientiae, 1(2), 257–268.
Saunders, C., Hardoon, D. R., Shawe-Taylor, J., & Widmer, G. (2004). Using string kernels to identify
famous performers from their playing style. In: Proceedings of the 15th European Conference on Machine
Learning (ECML’2004) (pp. 384–395), Springer.
Scheirer, E. (1995). Extracting expressive performance information from recorded music. Master’s thesis,
Program in Media Arts and Science, Massachusetts Institute of Technology.
Stamatatos, E., & Widmer, G. (2005). Automatic identification of music performers with learning ensembles.
Artificial Intelligence, 165(1), 37–56.
Tobudic, A., & Widmer, G. (2003). Learning to play mozart: Recent improvements. In: Proceedings of the
IJCAI’03 Workshop on Methods for Automatic Music Performance and their Applications in a Public
Rendering Contest.
Springer
Mach Learn
Tobudic, A., & Widmer, G. (2005). Learning to play like the great pianists. In: Proceedings of the 19th
International Joint Conference on Aritificial Intelligence (IJCAI’05).
Todd, N. (1992). The dynamics of dynamics: A model of musical expression. Journal of the Acoustical Society
of America, 91, 3540–3550.
Wang, T., Zheng, N., Li, Y., Xu, Y., & Shum, H. (2003). Learning kernel-based hmms for dynamic sequence
synthesis. Graphical Models, 65(4), 206–221.
Widmer, G., & Goebl, W. (2004). Computational models of expressive music performance: The state of the
art. Journal of New Music Research, 33(3), 203–216.
Widmer, G., & Tobudic, A. (2003). Playing mozart by analogy: Learning multi-level timing and dynamics
strategies. Journal of New Music Research, 32(3), 259–268.
Widmer, G. (2003). Discovering simple rules in complex data: A meta-learning algorithm and some surprising
musical discoveries. Artificial Intellignece, 146(2), 129–148.
Springer
Download