Uploaded by Xinyu Li

A Lightweight Deep Sequential Model for Continuous Chord and Melody Generation (1)

advertisement
A Lightweight Deep Sequential Model for
Continuous Chord and Melody Generation
Xinyu Li
Li Yi
Harry Lee
NYU Shanghai
Chengdu, China
xl3133@nyu.edu
NYU Shanghai
Beijing, China
ly1387@nyu.edu
NYU Shanghai
Leshan, China
hl3794@nyu.edu
Abstract—With the fast progress in deep learning in recent
years, generative models have given machines unprecedented
artistic creativity in the level of grids (e.g. images) and simple
sequences (e.g. words and sentences). However, music generation
remains to be a challenging fields for machines. In this paper,
we focus on music composition, specifically, chord and melody
generation, which is a fundamental part of songwriting. By
analyzing the symbolic representation of melody and chords in
some time series, deep sequential models should grasp the inner
connections between melody notes and chords. We propose a
lightweight deep sequential model that continues the composition
based on some given chords and melody. By using some masking
techniques, we can regularize the generating process, where we
can control the rhythm and decide where the melody notes should
change or rest.
I. I NTRODUCTION
Comparing to generative problems in computer vision (e.g.
image style transfer) and natural language processing (e.g.
dialogue system), music generation is a field where machine
learning has great untapped potential. After taking MUS-SHU
221 Songwriting in NYU Shanghai in Spring 2020 together,
we realize that the melody and chords of pop songs usually
contain some regular patterns, which should be learnable for
deep learning models. We decide to design a deep sequential
model for this melody and chord generation task as the final
project for CSCI-SHU 360 Machine Learning. In this paper,
we try to discuss this issue in detail by first specifying our
task as follows. Given a timestamp T , the melody and the
chords in previous time series {1, 2, ..., T − 1}, we expect the
model to predict the chord and melody note at time T . Then,
we shift T to T + 1 and continue this composition process
step by step.
II. DATA P REPARATION
A. Data Collection
The data collecting process of our project is quite tough. To
train our model, we will have to find chord-labeled melodies.
This information will be easy to fetch in MIDI files, which is
a kind of file format that is designed to store music scores.
But there are not many chord-labeled MIDIs published over
the web. One alternative dataset is P OP 909 [1], which was
created by NYU Shanghai’s Music-X-Lab and published in
ISMIR 2020. The dataset contains 909 records of Chinese
pop songs. Each record contains melody information and chord
information and was already segmented into phrases. But after
we fed P OP 909 to our model, the performance is not as
good as we expected. One possible explanation might the
chord progressions in P OP 909 are very diverse and some
are complicated, and our model is not complicated enough to
perform well on this dataset.
Concluding that the dataset used in our project must be
neat in structure, we realize that there are no such mature
datasets published over the Internet, so we decided to create
a suitable dataset by ourselves. In the preparation stage, we
listed some songs that we are going to record. These songs
can be classified into a variety of styles of homophonic
music, including GuFeng music and electronic music. Then
we extract the chorus part, with each extracted phrase contains
8 measures (128 semiquavers), and reclassify these songs
according to their chord progressions of the extracted phrases.
14 chord progression classes were created in total. In the
formal sampling stage, we recorded the MIDIs of these phrases
using Logic Pro and did a preliminary time quantification by
calling a built-in function of Logic Pro.
Eventually, we got 52 MIDI records, and this number
extends to 83 later. We saved the dataset with 52 records
and 83 records respectively and fed them to our model. The
performance of the dataset with 83 records is not as good as the
one with 52 records, probably because the classes of the chord
progressions are miscellaneous. In the following research, we
decided to work on the dataset with 52 records.
B. Data Pre-processing
In MIDI files, notes are stored as tuples, with each tuple
containing information of pitch, onset, offset, and velocity.
The pitch information is stored as a MIDI pitch, an integer
between 0 and 127. The Middle C (C4) is assigned MIDI pitch
60, while the interval of notes that are assigned consecutive
MIDI pitches is a semitone. In our data set, only 14 MIDI
pitches are used, they are: 65, 67, 69, 71, 72, 74, 76, 77, 79,
81, 83, 84, 86, 88. The onset and offset of a note are stored
as timestamps, the velocity is also an integer between 0 and
127.
In the data processing part, we utilized a Python third-party
package pretty midi [2], which was published in ISMIR
2014 by Colin Raffel et al. The package allows us to load
MIDI files and fetch the notes information. According to the
tempo stored in the MIDI file, we first calculated the time
length of the sixteenth-note and quantified each note to its
nearest sixteenth-note cell. Next, assuming that at most one
note can be playing at each timestamp, we sampled the note
that is playing at each sixteen-note cell and stored its MIDI
pitch into a sequence. For rests, we stored integer 0 into the
sequence. Eventually, we got a sequence with a length of
128, each element is a specific MIDI pitch. With the chord
progressions of melodies already stored, we also create a chord
sequence with the same length, indicating the chord at each
sixteenth-note cell. Each element of the chord sequence is an
integer between 1 and 6, referring to the six common chords I,
ii, iii, IV, V, vi. In total, 52 records were created, each record
is containing a melody sequence and a chord sequence all with
length 128. Counting in 0 class, elements in melody sequence
have 15 classes in total while elements in chord sequence have
6, as shown in Fig. 1.
Fig. 1. MIDI pitch and chord distribution
III. M ETHODOLOGY
A. Task Definition
There are multiple ways to present the problem of music
generation, since the structure and amount of data can vary a
lot and what we want the machine to write can be different
as well. We specify our task as follows. Given some melody
and chords of length T = 128, we would like to train a model
to compose the following melody and chords of length T ,
to form a music piece of length 2T . Specifically, we decide
that one timestamp correspond to a semiquaver. We try to
predict the chords and melody notes at the series of timestamps
{T + i}Ti=1 step by step, which means we first set i = 1 and
do the predictions and each time we increase i by 1 and do
the predictions. For each t ∈ {T + i}Ti=1 , our prediction only
condition on the melody notes and chords in the timestamps
−1
{t − T + j}Tj=0
. That is to say, we assume that the melody
note and chord at timestamp t is independent from notes and
chords before timestamp i − T . We create a slide window of
length T , and for each i, we set the start of the slide window
to be t − T and its end to be t − 1. Then, we predict only
conditionally on the information in the slide window. Each
time, we predict a melody note and a chord, then increase i
by 1 and push the slide window 1 step forward as well. By
doing this recursively for T times, we derive a music piece of
length 2T .
B. Model Structures
We have a chord model and a melody model for the task,
which would be explained separately in the following paragraphs and illustrated in Fig. 2. and Fig. 3. Here we denote the
length of the sequence (of melody notes and chords) we would
feed in the model by seq len. Note that the hidden dimensions
for the chord embeddings, the melody embeddings and the
output embeddings are chosen as 4, 8 and 16, which are
hyperparameters that may be further optimized. The number
of layers of LSTM we stack is also a hyperparameter that can
be tuned.
1) The Chord Model: As mentioned in the previous paragraphs, the prediction of chords conditions on the previous
chords and previous melodies. The chord model first transfer
the tensor of previous chords of shape (seq len, 6) and
the tensor of previous melodies of shape (seq len, 15) into
two embeddings of shape (seq len, 4) and (seq len, 8) with
some linear transformation and tanh activation. Then, we
concatenate the chord embeddings and melody embeddings on
dimension 1 into the input embeddings of shape (seq len, 12)
as the input of a 3-layer LSTM network [3]. We index the
last embedding of the output of our LSTM network and do
a linear transformation to derive the probability distribution
of the chord we are predicting. In the training process, we
minimize the cross-entropy loss between this distribution and
the ground truth one-hot vector of the chord currently, for
every timestamp. In the generating process, we apply an
argmax operation on the distribution to get the chord as our
prediction for the current timestamp, and iterate this process
continually for the following timestamps.
2) The Melody Model: The general idea of the melody
model is the same as the chord model. However, the prediction
of melody conditions on one extra element comparing with the
prediction of chords, i.e. the chord note of the current stamp.
Hence, the input sequence of chords is of length seq len+1 as
shown in Fig. 3. However, we want our input of the LSTM network to have length seq len for convenience of concatenation.
Thus, we apply an element-wise product between the tensor
of the current chord and the tensor of previous chords, whose
outcome has length seq len, to integrate their information.
This operation is inspired by dot-product attention [4]. The
remaining parts of the melody model is almost the same as
the chord model.
C. Approaches
1) Basic Method: As stated above in the task definition,
We need to predict T times step by step. For timestamp
t ∈ {T + i}Ti=1 , we have our conditions as two tensors
Mt−T,t−1 (representing the previous T melody notes) and
Ct−T,t−1 (representing the previous T chords). Firstly, we
feed Mt−T,t−1 and Ct−T,t−1 into the chord model and it
would predict the chord of the current timestamp, denoted
by Ct . Then, we feed Mt−T,t−1 , Ct−T,t−1 and Ct into the
Fig. 2. The chord model
Fig. 3. The melody model
melody model and its prediction would be the current melody
note, denoted by Mt . Afterwards, we append Mt and Ct to
our current melody line and chord line, increase t by 1, push
our slide window one step forward and do the next prediction.
This recursive process is repeated for T times to get a music
piece of length 2T .
2) Regularization: The training process can be smoothly
done by the basic method. We simply minimize the crossentropy loss between the predicted distribution and ground
truth one-hot vector at every timestamp. However, by using
the basic directly, we observe that the generating process
is very unstable. One important reason is that once the
melody/chord model has predicted a bad note/chord, these
bad information would be appended into our data and would
have a huge negative impact on our next prediction. Therefore,
we should use some techniques to regularize our generation.
For chords, we adopt the prediction every 8 timestamp. That
is to say, we use the prediction of the model at timestamps
{T +1, T +9, ..., 2T }. At other timestamps, we directly inherit
the chord of the previous timestamp. We do this since chords
should have much variations in the level of semiquavers. The
fact is that chords usually change every 8 or 16 semiquavers
in pop songs.
For melody generation, we aim to generate a new piece of
melody that has the same rhythm as the original given melody.
Thus, we introduce the tricks of masking. We have two masks,
namely the note mask and the pitch shift mask. Both masks
are generated based on the rhythm of the given melody of
length T , which are illustrated in Fig. 4. below. The note mask
marks the timestamps where the note is or is not a rest, while
the pitch shift mask marks the timestamps where the pitch
changes or doesn’t. For each timestamp, if the note mask is 0,
then we simply put a rest here and skip the prediction. If the
note mask is 1 while the pitch shift mask is 0, then we directly
inherit the previous note and skip the prediction. Otherwise,
we know that a new, non-rest note should appear here, then
we do the melody prediction and sample one note from the
predicted distribution that is both non-rest and different from
the previous melody note.
Another thing worth mentioning is that each time after
melody prediction the sampling method we use is top-k
sampling instead of greedy sampling. The purpose is to add
some variation in the melody line, otherwise we might end up
having a sequence of the same root notes of each chord.
lines are shown in Fig. 4. Note that the first line is the given
melody and the second line is the generated melody. The given
melody comes from the song ”Scared to Be Lonely - Martin
Garrix & Dua Lipa”.
Fig. 5. Example of given melody and generated melody
Fig. 4. An example of (a sub-sequence of) our melody line and the two
masks
IV. R ESULTS
The performance of the chord model is not very ideal while
the melody model has a relatively satisfactory performance.
When evaluating the task and our approaches as a whole,
we realize that one reason that the chord model does not work
well lies in the task itself. In songwriting, it is common that
the chord progression is generally fixed before the melody is
written. However, at every timestamp, our model would predict
the current chord based on both the previous melody and the
previous chords, which would introduce much instability in the
generating process. What’s worse, once the model has done a
bad prediction, this bad prediction would have a huge negative
influence on the following predictions.
The melody model turns out to perform relatively well.
Since we use top-k sampling, some randomness is introduced
in the generating process, which is helpful since we need much
variation in a piece of melody. For most cases, we would
derive a new piece of melody of length T (following the
given melody of length T ) that sounds smooth and fit with
the chord progression. However, in many cases, the generated
melody is relative messy comparing with the melody lines
in our data. The generated melody still lacks the sense of
”phrases”, which means that although the melody notes indeed
fit with the chords, they don’t make much sense as a whole.
In songwriting, beside variation, another important feature is
repetition. One ideal case would be that the melody consists
of a few phrases, where these phrases have some variation
but still sound similar and related. There were a few times
in our experiment when the model actually generated a piece
of melody that has the sense of ”phrases” to some extent.
We picked one of these melody lines and did some basic
arrangements in FL Studio 20, and derived a music piece to
be shown in the final presentation. Also, part of the melody
V. F UTURE WORK
There are two novel ideas provided by Professor Gus Xia at
NYU Shanghai, which we sincerely appreciate. Firstly, when
predicting the melody notes, we can introduce a hierarchical
learning mechanism. We can first predict whether the current
note is a chord tone or not, then predict its pitch based on this
feature. Secondly, for chord prediction, we can introduce some
attention mechanism between the given chord embeddings and
the predicted chord embedding, to make the piece of music
sound better as a whole.
ACKNOWLEDGMENT
This project is tutored by Professor Guo Li at NYU Shanghai. We also thank Professor Gus Xia for his advice on the
feasibility of our proposal, without which we might be unable
to take the first step of this project. We also appreciate his
idea of introducing attention mechanism between the melody
and chords, which is of great research significance.
R EFERENCES
[1] Z. Wang*, K. Chen*, J. Jiang, Y. Zhang, M. Xu, S. Dai, G. Bin, and
G. Xia, “Pop909: A pop-song dataset for music arrangement generation,”
in Proceedings of 21st International Conference on Music Information
Retrieval, ISMIR, 2020.
[2] C. Raffel and D. P. W. Ellis, “Intuitive analysis, creation and manipulation
of midi data with pretty midi,” in Proceedings of the 15th International
Conference on Music Information Retrieval Late Breaking and Demo
Papers, ISMIR, 2014.
[3] C. Olah, Understanding LSTM Networks, Aug 2015. Available at
https://colah.github.io/posts/2015-08-Understanding-LSTMs/.
[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” CoRR,
vol. abs/1706.03762, 2017.
Download