1 Lead Sheets
Lead sheets are widely used to represent the fundamental musical information about almost any contemporary song: they contain a chord scheme, a melody line, some navigation and repetition markers, and sometimes lyrics. They seldom contain information about the instrumentation or accompaniment, so any band can take a lead sheet as a guideline and make the song their own, sometimes even by improvising over the chord schemes. In this paper we focus on generating chords and melody lines for lead sheets from scratch.
A major difficulty in music generation is that harmony, melody and rhythm all influence each other. For example, a melody note can change whenever the underlying harmony changes, and vice versa. Rhythmic patterns can influence which notes are played, and rhythm and harmony together define the overall groove of the piece. To tackle this issue, we split the generation process into two stages. First, we generate a harmonic progression using chord sequences, while simultaneously picking the most appropriate rhythmic patterns. And in a second step the melody is generated on top of this harmonic and rhythmic template.
We are, however, not the first to tackle the problem of lead sheet generation and, in general, music generation. Briot et al. provide a recent and extensive overview of all deep learning based techniques in this field [1]. Regarding lead sheets specifically, Liu et al. use GAN-based models on piano roll representations, but the melody and chords are still predicted independently by different generators [9]. Roy et al. devise a lead sheet generator with user constraints defined by Markov models, and harmonic synchronization between melody and chords through a probabilistic model that encodes which melody notes fit on which chords [11]. There have also been many efforts in the past that learn to generate chords for a given melody [3, 8, 10] or the other way round [12]. In this paper we want to show that, on the one hand, a two-stage generation process greatly improves the perceived quality of the music. And, on the other hand, we show that melodic coherence improves when the melody generator gets to look ahead at the entire harmonic template of the song.
![$$x_{1:n}$$](../images/496776_1_En_38_Chapter/496776_1_En_38_Chapter_TeX_IEq1.png)
![$$c_{1:n}$$](../images/496776_1_En_38_Chapter/496776_1_En_38_Chapter_TeX_IEq2.png)
![$$r_{1:n}$$](../images/496776_1_En_38_Chapter/496776_1_En_38_Chapter_TeX_IEq3.png)
![$$m_{1:n}$$](../images/496776_1_En_38_Chapter/496776_1_En_38_Chapter_TeX_IEq4.png)
![$$\begin{aligned} x_{1:n}&= \left\{ c_{1:n}, r_{1:n}, m_{1:n} \right\} ,\quad x_i = \left\{ c_i, r_i, m_i \right\} . \end{aligned}$$](../images/496776_1_En_38_Chapter/496776_1_En_38_Chapter_TeX_Equ1.png)
![$$x_{j:k}$$](../images/496776_1_En_38_Chapter/496776_1_En_38_Chapter_TeX_IEq5.png)
![$$\left( x_j, x_{j+1} \dots x_k\right) $$](../images/496776_1_En_38_Chapter/496776_1_En_38_Chapter_TeX_IEq6.png)
![../images/496776_1_En_38_Chapter/496776_1_En_38_Fig1_HTML.png](../images/496776_1_En_38_Chapter/496776_1_En_38_Fig1_HTML.png)
Example of a lead sheet decomposition into chords, rhythms and melodies.
2 The Wikifonia Dataset
In this paper we will make use of the Wikifonia dataset, a former public lead sheet repository hosted by wikifonia.org. It contains more than 6,500 lead sheets in MusicXML format, and in all sorts of modern genres. This section goes over the different preprocessing and encoding steps that are executed on the dataset in order to obtain a clean collection of lead sheets.
2.1 Preprocessing
Eliminate Polyphony. Whenever multiple notes sound at the same time, we only retain the note with the highest pitch, as it is often the note that characterizes the melody.
Ignore Ties. Connections between two notes with the same pitch that extend the first note’s duration are ignored. The two notes are therefore treated as two separate notes with their original duration.
Delete Anacruses. Incomplete bars that often appear at the start of a piece, are removed from all lead sheets.
Unfold Repetitions. Lead sheets can contain repetition and other navigation markers. If a section should be repeated, we duplicate that particular section, thereby unfolding the piece into a single linear sequence.
Remove Ornaments. Since such ornaments do not contribute much to the overall melody, we leave them out.
2.2 Data Encoding and Features
After preprocessing, we encode the melody, rhythm and chord symbols into feature vectors such that they can be used as input to our generators.
Encoding Rhythms. We retain the 12 most common rhythm types in the dataset, which are given in Appendix B. We remove 184 lead sheets from the dataset that contain other than these 12 types. Together with the representation for a barline, we encode rhythm into a 13-dimensional one-hot vector .
Encoding Chords. A chord is described by both its root and its mode. There are 12 possible roots (C, C, D, D
, ..., B) and we choose to convert all accidentals to either no alteration or one sharp. We count 47 different modes in the dataset, which we map to one of the following four: major, minor, diminished or augmented. This mapping only very slightly reduces musical expressivity and interestingness. The mapping table can be found in Appendix A. The 12 roots and 4 modes give 48 chord options in total, resulting in a 49-dimensional one-hot vector
if we include the barline.
Encoding Melody. The MIDI standard defines 128 possible pitches. We assign two additional dimensions for rests and barlines, resulting in a 130-dimensional one-hot encoded melody vector .
3 Recurrent Neural Network Design
As mentioned in Sect. 1, the lead sheet generation process happens in two stages: in stage one the rhythm and chord template of the song is learned, and in stage two the melody notes are learned on top of that template. We will use separate LSTM-based models for both stages [2]; the models are trained independently of each other, but they are combined at inference time to generate an entire lead sheet from scratch. Figure 2 shows the complete architecture.
Stage One. In this stage, the rhythm and chord vectors are first concatenated and are subsequently given as inputs to two LSTM layers followed by a dense layer. All LSTM layers have a output dimensionality of 512 states, as indicated in the figure. The output of the dense layer is cut in two vectors, both on which we apply a softmax nonlinearity with temperature , controlling the concentration of the output distribution. This way we are effectively modeling a distribution over the chord and rhythm symbols that come next in the sequence.
![../images/496776_1_En_38_Chapter/496776_1_En_38_Fig2_HTML.png](../images/496776_1_En_38_Chapter/496776_1_En_38_Fig2_HTML.png)
The RNN architecture for both stage one (left) and stage two (right). The output dimensionality for every layer is written in each of the blocks. Whenever two blocks appear next to each other, the (output) vectors are concatenated.
3.1 Optimization Details
![$$\alpha $$](../images/496776_1_En_38_Chapter/496776_1_En_38_Chapter_TeX_IEq13.png)
![$$\begin{aligned} \mathcal {L}_{\mathrm {stage\,1}}\!\left( \hat{\varvec{c}}, \hat{\varvec{r}}\right) = \alpha \cdot \mathcal {L}_{\mathrm {CE}}\!\left( \hat{\varvec{c}}\right) + (1 - \alpha )\cdot \mathcal {L}_{\mathrm {CE}}\!\left( \hat{\varvec{r}}\right) . \end{aligned}$$](../images/496776_1_En_38_Chapter/496776_1_En_38_Chapter_TeX_Equ2.png)
![$$\mathcal {L}_{\mathrm {CE}}\!\left( \cdot \right) $$](../images/496776_1_En_38_Chapter/496776_1_En_38_Chapter_TeX_IEq14.png)
![$$\begin{aligned} \mathcal {L}_{\mathrm {stage\,2}}\!\left( \hat{\varvec{m}}\right) = \mathcal {L}_{\mathrm {CE}}\!\left( \hat{\varvec{m}}\right) . \end{aligned}$$](../images/496776_1_En_38_Chapter/496776_1_En_38_Chapter_TeX_Equ3.png)
![$$\lambda $$](../images/496776_1_En_38_Chapter/496776_1_En_38_Chapter_TeX_IEq15.png)
![$$-12$$](../images/496776_1_En_38_Chapter/496776_1_En_38_Chapter_TeX_IEq16.png)
4 Experiments
Hyperparameters. In all experiments we use a batch size of 128 sequences, each of length 100. The learning rate is set fixed to 0.001. We empirically found that a value
of 0.5 leads to good results, so we set it fixed to that value. We also set the temperature
slightly lower than 1 during inference of the melody, which helps to improve the perceived quality of the generated music; we varied it between 0.75 and 1.0 during the experiments. For the rhythm and chord patterns a temperature of 1.0 gives the most pleasing results.
- 1.
An unconditioned LSTM-based model similar to the stage one model in Fig. 2, but now the melody is also concatenated to the input and output. The melody is no longer conditioned on the entire chord and rhythm sequence. We also add an extra LSTM layer, adding up to a total of three.
- 2.
A two-stage model where the BiLSTM layers are replaced by regular LSTM layers, so that the melody cannot look ahead at the harmonic sequence. We keep all other parameters identical to the original model.
3 pieces generated by the two-stage model from scratch,
3 pieces generated by the two-stage model, but conditioned on the chord and rhythm scheme of existing songs: I Have a Dream (Abba), Autumn Leaves (jazz standard) and Colors of the Wind (Alan Menken),
2 pieces generated by the one-stage baseline model,
2 pieces generated by the two-stage baseline model,
2 (relatively unknown) human-composed songs: You Belong to my Heart (Bing Crosby) and One Small Photograph (Kevin Shegog).
As stated in Sect. 1, a lead sheet only encodes the basic template of a song, and it ideally needs to be played by a real musician. We therefore gave all lead sheets to a semi-professional pianist; the pianist stayed true to the sheet music, but was free to create an accompaniment that suited the piece. In our regards, this evaluation method reflects best how a lead sheet, produced by an AI model, would in practice be used and experienced by musicians and listeners.
![$$-0.5$$](../images/496776_1_En_38_Chapter/496776_1_En_38_Chapter_TeX_IEq20.png)
![$$\begin{aligned} Z_{c,u} = \frac{R_{c,u} - \mu _u}{\max _{c'} R_{c',u} - \min _{c'} R_{c',u}}. \end{aligned}$$](../images/496776_1_En_38_Chapter/496776_1_En_38_Chapter_TeX_Equ4.png)
![$$R_{c,u}$$](../images/496776_1_En_38_Chapter/496776_1_En_38_Chapter_TeX_IEq21.png)
![$$\mu _u$$](../images/496776_1_En_38_Chapter/496776_1_En_38_Chapter_TeX_IEq22.png)
![$$Z_{c,u}$$](../images/496776_1_En_38_Chapter/496776_1_En_38_Chapter_TeX_IEq23.png)
![$$\bar{Z}$$](../images/496776_1_En_38_Chapter/496776_1_En_38_Chapter_TeX_IEq24.png)
We observe that the scores are, by far, better for the two-stage models compared to the unconditioned one-stage model. This shows that first sampling a harmonic and rhythmic sequence, and conditioning the melody on top of this sequence, is more beneficial than sampling all quantities simultaneously. Next to this, we also notice that adding the BiLSTM layers improves the score for all three questions. And although by a small margin, we can conclude that the musical quality improves when the melody generator can look ahead in the harmonic sequence. When we condition the melody generator on an existing chord and rhythm scheme, it is remarkable that the human-composed and AI-composed songs perform almost on par. The AI-composed songs are even considered most pleasing. Related to this observation, 4 participants indicated having recognized a piece from the two-stage model, 5 recognized a piece that was generated based on existing chords, and 3 participants recognized a human-composed song.
Results of the subjective listening experiments. We report averaged -scores for each of the questions, along with the standard deviations.
Model | Pleasing | Coherence | Turing |
---|---|---|---|
One-stage | |||
Two-stage, without BiLSTM | |||
Two-stage, with BiLSTM | |||
Two-stage, with existing chords | |||
Human-composed songs |
5 Conclusion
We have proposed a two-stage LSTM-based model to generate lead sheets from scratch. In the first stage, a sequence of chords and rhythm patterns is generated, and in the second stage the sequence of melody notes is generated conditioned on the output of the first stage. We conducted a subjective listening test of which the results showed that our approach outperformed the baselines. We can therefore conclude that conditioning helps the quality of the generated music, and that this approach can be explored further in the future.