A Lightweight Deep Sequential Model for Continuous Chord and Melody Generation Xinyu Li Li Yi Harry Lee NYU Shanghai Chengdu, China xl3133@nyu.edu NYU Shanghai Beijing, China ly1387@nyu.edu NYU Shanghai Leshan, China hl3794@nyu.edu Abstract—With the fast progress in deep learning in recent years, generative models have given machines unprecedented artistic creativity in the level of grids (e.g. images) and simple sequences (e.g. words and sentences). However, music generation remains to be a challenging fields for machines. In this paper, we focus on music composition, specifically, chord and melody generation, which is a fundamental part of songwriting. By analyzing the symbolic representation of melody and chords in some time series, deep sequential models should grasp the inner connections between melody notes and chords. We propose a lightweight deep sequential model that continues the composition based on some given chords and melody. By using some masking techniques, we can regularize the generating process, where we can control the rhythm and decide where the melody notes should change or rest. I. I NTRODUCTION Comparing to generative problems in computer vision (e.g. image style transfer) and natural language processing (e.g. dialogue system), music generation is a field where machine learning has great untapped potential. After taking MUS-SHU 221 Songwriting in NYU Shanghai in Spring 2020 together, we realize that the melody and chords of pop songs usually contain some regular patterns, which should be learnable for deep learning models. We decide to design a deep sequential model for this melody and chord generation task as the final project for CSCI-SHU 360 Machine Learning. In this paper, we try to discuss this issue in detail by first specifying our task as follows. Given a timestamp T , the melody and the chords in previous time series {1, 2, ..., T − 1}, we expect the model to predict the chord and melody note at time T . Then, we shift T to T + 1 and continue this composition process step by step. II. DATA P REPARATION A. Data Collection The data collecting process of our project is quite tough. To train our model, we will have to find chord-labeled melodies. This information will be easy to fetch in MIDI files, which is a kind of file format that is designed to store music scores. But there are not many chord-labeled MIDIs published over the web. One alternative dataset is P OP 909 [1], which was created by NYU Shanghai’s Music-X-Lab and published in ISMIR 2020. The dataset contains 909 records of Chinese pop songs. Each record contains melody information and chord information and was already segmented into phrases. But after we fed P OP 909 to our model, the performance is not as good as we expected. One possible explanation might the chord progressions in P OP 909 are very diverse and some are complicated, and our model is not complicated enough to perform well on this dataset. Concluding that the dataset used in our project must be neat in structure, we realize that there are no such mature datasets published over the Internet, so we decided to create a suitable dataset by ourselves. In the preparation stage, we listed some songs that we are going to record. These songs can be classified into a variety of styles of homophonic music, including GuFeng music and electronic music. Then we extract the chorus part, with each extracted phrase contains 8 measures (128 semiquavers), and reclassify these songs according to their chord progressions of the extracted phrases. 14 chord progression classes were created in total. In the formal sampling stage, we recorded the MIDIs of these phrases using Logic Pro and did a preliminary time quantification by calling a built-in function of Logic Pro. Eventually, we got 52 MIDI records, and this number extends to 83 later. We saved the dataset with 52 records and 83 records respectively and fed them to our model. The performance of the dataset with 83 records is not as good as the one with 52 records, probably because the classes of the chord progressions are miscellaneous. In the following research, we decided to work on the dataset with 52 records. B. Data Pre-processing In MIDI files, notes are stored as tuples, with each tuple containing information of pitch, onset, offset, and velocity. The pitch information is stored as a MIDI pitch, an integer between 0 and 127. The Middle C (C4) is assigned MIDI pitch 60, while the interval of notes that are assigned consecutive MIDI pitches is a semitone. In our data set, only 14 MIDI pitches are used, they are: 65, 67, 69, 71, 72, 74, 76, 77, 79, 81, 83, 84, 86, 88. The onset and offset of a note are stored as timestamps, the velocity is also an integer between 0 and 127. In the data processing part, we utilized a Python third-party package pretty midi [2], which was published in ISMIR 2014 by Colin Raffel et al. The package allows us to load MIDI files and fetch the notes information. According to the tempo stored in the MIDI file, we first calculated the time length of the sixteenth-note and quantified each note to its nearest sixteenth-note cell. Next, assuming that at most one note can be playing at each timestamp, we sampled the note that is playing at each sixteen-note cell and stored its MIDI pitch into a sequence. For rests, we stored integer 0 into the sequence. Eventually, we got a sequence with a length of 128, each element is a specific MIDI pitch. With the chord progressions of melodies already stored, we also create a chord sequence with the same length, indicating the chord at each sixteenth-note cell. Each element of the chord sequence is an integer between 1 and 6, referring to the six common chords I, ii, iii, IV, V, vi. In total, 52 records were created, each record is containing a melody sequence and a chord sequence all with length 128. Counting in 0 class, elements in melody sequence have 15 classes in total while elements in chord sequence have 6, as shown in Fig. 1. Fig. 1. MIDI pitch and chord distribution III. M ETHODOLOGY A. Task Definition There are multiple ways to present the problem of music generation, since the structure and amount of data can vary a lot and what we want the machine to write can be different as well. We specify our task as follows. Given some melody and chords of length T = 128, we would like to train a model to compose the following melody and chords of length T , to form a music piece of length 2T . Specifically, we decide that one timestamp correspond to a semiquaver. We try to predict the chords and melody notes at the series of timestamps {T + i}Ti=1 step by step, which means we first set i = 1 and do the predictions and each time we increase i by 1 and do the predictions. For each t ∈ {T + i}Ti=1 , our prediction only condition on the melody notes and chords in the timestamps −1 {t − T + j}Tj=0 . That is to say, we assume that the melody note and chord at timestamp t is independent from notes and chords before timestamp i − T . We create a slide window of length T , and for each i, we set the start of the slide window to be t − T and its end to be t − 1. Then, we predict only conditionally on the information in the slide window. Each time, we predict a melody note and a chord, then increase i by 1 and push the slide window 1 step forward as well. By doing this recursively for T times, we derive a music piece of length 2T . B. Model Structures We have a chord model and a melody model for the task, which would be explained separately in the following paragraphs and illustrated in Fig. 2. and Fig. 3. Here we denote the length of the sequence (of melody notes and chords) we would feed in the model by seq len. Note that the hidden dimensions for the chord embeddings, the melody embeddings and the output embeddings are chosen as 4, 8 and 16, which are hyperparameters that may be further optimized. The number of layers of LSTM we stack is also a hyperparameter that can be tuned. 1) The Chord Model: As mentioned in the previous paragraphs, the prediction of chords conditions on the previous chords and previous melodies. The chord model first transfer the tensor of previous chords of shape (seq len, 6) and the tensor of previous melodies of shape (seq len, 15) into two embeddings of shape (seq len, 4) and (seq len, 8) with some linear transformation and tanh activation. Then, we concatenate the chord embeddings and melody embeddings on dimension 1 into the input embeddings of shape (seq len, 12) as the input of a 3-layer LSTM network [3]. We index the last embedding of the output of our LSTM network and do a linear transformation to derive the probability distribution of the chord we are predicting. In the training process, we minimize the cross-entropy loss between this distribution and the ground truth one-hot vector of the chord currently, for every timestamp. In the generating process, we apply an argmax operation on the distribution to get the chord as our prediction for the current timestamp, and iterate this process continually for the following timestamps. 2) The Melody Model: The general idea of the melody model is the same as the chord model. However, the prediction of melody conditions on one extra element comparing with the prediction of chords, i.e. the chord note of the current stamp. Hence, the input sequence of chords is of length seq len+1 as shown in Fig. 3. However, we want our input of the LSTM network to have length seq len for convenience of concatenation. Thus, we apply an element-wise product between the tensor of the current chord and the tensor of previous chords, whose outcome has length seq len, to integrate their information. This operation is inspired by dot-product attention [4]. The remaining parts of the melody model is almost the same as the chord model. C. Approaches 1) Basic Method: As stated above in the task definition, We need to predict T times step by step. For timestamp t ∈ {T + i}Ti=1 , we have our conditions as two tensors Mt−T,t−1 (representing the previous T melody notes) and Ct−T,t−1 (representing the previous T chords). Firstly, we feed Mt−T,t−1 and Ct−T,t−1 into the chord model and it would predict the chord of the current timestamp, denoted by Ct . Then, we feed Mt−T,t−1 , Ct−T,t−1 and Ct into the Fig. 2. The chord model Fig. 3. The melody model melody model and its prediction would be the current melody note, denoted by Mt . Afterwards, we append Mt and Ct to our current melody line and chord line, increase t by 1, push our slide window one step forward and do the next prediction. This recursive process is repeated for T times to get a music piece of length 2T . 2) Regularization: The training process can be smoothly done by the basic method. We simply minimize the crossentropy loss between the predicted distribution and ground truth one-hot vector at every timestamp. However, by using the basic directly, we observe that the generating process is very unstable. One important reason is that once the melody/chord model has predicted a bad note/chord, these bad information would be appended into our data and would have a huge negative impact on our next prediction. Therefore, we should use some techniques to regularize our generation. For chords, we adopt the prediction every 8 timestamp. That is to say, we use the prediction of the model at timestamps {T +1, T +9, ..., 2T }. At other timestamps, we directly inherit the chord of the previous timestamp. We do this since chords should have much variations in the level of semiquavers. The fact is that chords usually change every 8 or 16 semiquavers in pop songs. For melody generation, we aim to generate a new piece of melody that has the same rhythm as the original given melody. Thus, we introduce the tricks of masking. We have two masks, namely the note mask and the pitch shift mask. Both masks are generated based on the rhythm of the given melody of length T , which are illustrated in Fig. 4. below. The note mask marks the timestamps where the note is or is not a rest, while the pitch shift mask marks the timestamps where the pitch changes or doesn’t. For each timestamp, if the note mask is 0, then we simply put a rest here and skip the prediction. If the note mask is 1 while the pitch shift mask is 0, then we directly inherit the previous note and skip the prediction. Otherwise, we know that a new, non-rest note should appear here, then we do the melody prediction and sample one note from the predicted distribution that is both non-rest and different from the previous melody note. Another thing worth mentioning is that each time after melody prediction the sampling method we use is top-k sampling instead of greedy sampling. The purpose is to add some variation in the melody line, otherwise we might end up having a sequence of the same root notes of each chord. lines are shown in Fig. 4. Note that the first line is the given melody and the second line is the generated melody. The given melody comes from the song ”Scared to Be Lonely - Martin Garrix & Dua Lipa”. Fig. 5. Example of given melody and generated melody Fig. 4. An example of (a sub-sequence of) our melody line and the two masks IV. R ESULTS The performance of the chord model is not very ideal while the melody model has a relatively satisfactory performance. When evaluating the task and our approaches as a whole, we realize that one reason that the chord model does not work well lies in the task itself. In songwriting, it is common that the chord progression is generally fixed before the melody is written. However, at every timestamp, our model would predict the current chord based on both the previous melody and the previous chords, which would introduce much instability in the generating process. What’s worse, once the model has done a bad prediction, this bad prediction would have a huge negative influence on the following predictions. The melody model turns out to perform relatively well. Since we use top-k sampling, some randomness is introduced in the generating process, which is helpful since we need much variation in a piece of melody. For most cases, we would derive a new piece of melody of length T (following the given melody of length T ) that sounds smooth and fit with the chord progression. However, in many cases, the generated melody is relative messy comparing with the melody lines in our data. The generated melody still lacks the sense of ”phrases”, which means that although the melody notes indeed fit with the chords, they don’t make much sense as a whole. In songwriting, beside variation, another important feature is repetition. One ideal case would be that the melody consists of a few phrases, where these phrases have some variation but still sound similar and related. There were a few times in our experiment when the model actually generated a piece of melody that has the sense of ”phrases” to some extent. We picked one of these melody lines and did some basic arrangements in FL Studio 20, and derived a music piece to be shown in the final presentation. Also, part of the melody V. F UTURE WORK There are two novel ideas provided by Professor Gus Xia at NYU Shanghai, which we sincerely appreciate. Firstly, when predicting the melody notes, we can introduce a hierarchical learning mechanism. We can first predict whether the current note is a chord tone or not, then predict its pitch based on this feature. Secondly, for chord prediction, we can introduce some attention mechanism between the given chord embeddings and the predicted chord embedding, to make the piece of music sound better as a whole. ACKNOWLEDGMENT This project is tutored by Professor Guo Li at NYU Shanghai. We also thank Professor Gus Xia for his advice on the feasibility of our proposal, without which we might be unable to take the first step of this project. We also appreciate his idea of introducing attention mechanism between the melody and chords, which is of great research significance. R EFERENCES [1] Z. Wang*, K. Chen*, J. Jiang, Y. Zhang, M. Xu, S. Dai, G. Bin, and G. Xia, “Pop909: A pop-song dataset for music arrangement generation,” in Proceedings of 21st International Conference on Music Information Retrieval, ISMIR, 2020. [2] C. Raffel and D. P. W. Ellis, “Intuitive analysis, creation and manipulation of midi data with pretty midi,” in Proceedings of the 15th International Conference on Music Information Retrieval Late Breaking and Demo Papers, ISMIR, 2014. [3] C. Olah, Understanding LSTM Networks, Aug 2015. Available at https://colah.github.io/posts/2015-08-Understanding-LSTMs/. [4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” CoRR, vol. abs/1706.03762, 2017.