17 Project Magenta: AI Creates Its Own Music
Can we use machine learning to create compelling art and music? If so, how? If not, why not?
—Douglas Eck5
As head of Google’s Project Magenta, Douglas Eck has found a way to combine his two loves—AI and music—in his work. An open and outspoken man, he first took an interest in computers at the age of thirteen. “I was from the floppy disc era,” he says, “but never thought of myself as a computer person.” Nevertheless, he became adept at coding.
Like many people working in AI and the arts, Eck has had an extraordinary career trajectory. As an undergraduate at Indiana University, he studied English literature and creative writing but couldn’t find good employment opportunities. Instead he became a programmer, a career he had never planned on. As a young database programmer in Albuquerque, New Mexico, he became interested in the creative possibilities of coding. But he was also a musician. He enjoyed playing a sort of punk folk music on piano and guitar in coffee houses there. At twenty-four, he found himself bored with database programming and in love with music. He asked himself, “What’s at the intersection of the two? Maybe I can put them together. So I said to myself, let’s do AI and music.”6
Eck had no training in AI except for being a self-taught coder, and in 1994 he applied to Indiana University to study computer science and cognitive science. There he banged on the door of one of the university’s stellar professors, Douglas Hofstadter, author of Gödel, Escher, Bach, and asked to work with him. In retrospect, Eck says, this was an extremely unsuitable way to approach the great man. “But I just did it. This is the way I work.”7 After a year of bringing his mathematics up to date, Eck formally entered the graduate program—though, in the end, he did not work with Hofstadter because Hofstadter did not believe that AI was anywhere near sophisticated enough to produce innovative music. Hofstadter continues to maintain this view of AI to this day.8
At Indiana, Eck worked with researchers in cognitive science, studying the use of electronic devices to explore how we sense rhythm, what makes us tap our feet to the beat. “I just did it,” he says—his catchphrase. “I just started working on musical cognition and computing and never looked back.”9
At the Dalle Molle Institute for Artificial Intelligence (IDSIA) in Lugano, Switzerland, he worked on how to compose music on a neural network that had been trained on a form of twelve-bar blues popular among bebop jazz musicians.10 In music, as in literature, there is a narrative structure, a beginning, a middle, and an end in which themes are reprised. Progress is not relentlessly linear. Machines have problems with this because many are built with an architecture that makes it difficult for them to remember what went before. Eck addressed himself to this problem.
His research was built on recurrent neural networks (RNNs), developed in the 1980s when neural networks were in their infancy. Instead of the usual feed-forward process for solving a problem, like composing music, instead of always looking to the next step, the networks could also move a few steps back or in a feedback loop. This meant they had a sort of memory. Past decisions could affect their response to new data. This avoided the restrictions required by Markov chains, which moved in one direction only: forward.
But RNNs could easily lose their knowledge of the past if the weights—the strengths of the connections between neurons—were not perfectly adjusted in the feedback loops. If the weights were set too high, the system became hyper and blew up in a riot of learning; if too low, the RNNs never remembered anything really interesting.
In 1997, Sepp Hochreiter and Jürgen Schmidhuber invented long short-term memory (LSTM).11 LSTM protects RNN against memory loss by gating off the results accrued by the RNNs and keeping them in a memory cache, much as we do as we work over processes in our minds.
In 2002, Eck and Schmidhuber built on these early efforts, switching from RNN memory cells to LSTM cells. The aim was to set up a stable music generator. But at the time, neural networks didn’t have sufficient hidden layers, and there wasn’t enough data available to train them. As a result, the machine produced either monotonous riffs or jumbles of notes.
Eck joined Google in 2010. He was eager to tackle the challenge of combining AI and music. Other researchers, such as Leon Gatys with his style transfer, were combining machine learning with art. François Pachet, whom Eck greatly admired, was doing the same with AI and music. “People have carved out their own spaces. There’s plenty of room for all of us,” he recalls thinking.12 He was encouraged by what he saw as the possibilities of machines to generate art and music using deep learning. Google was in a unique position to do this using Google Brain, its deep learning AI research project, initiated in 2011 in Mountain View, California. (Deep learning machines learn directly from data, such as images or moves in the game of Go, without needing to be programmed with specific algorithms.)
Eck made a proposal to Google for what he called Project Magenta, with himself at the helm, to tackle these questions: “Can we use machine learning to create compelling art and music? If so, how? If not, why not?”13
Creativity can only flourish when there are boundaries, guidelines, limiting factors: constraints. You can’t create an advertisement for shoes without defining the kind of shoe, the price range, and the target market. Eck’s constraint was to couple Magenta to Google Brain and its deep learning facility, applying machine learning with deep neural networks at every step—end to end; in other words, the machine teaches itself at every step of the process.14 This minimalist approach had been immensely successful in developing Google’s work in recognizing images and, a few years later, in translating languages and analyzing sound.
At Magenta, everything they used was open source—including TensorFlow, Google’s extensive software library on GitHub, a giant coding and software-development platform. Says Eck, Magenta is “the glue between TensorFlow and the artistic and the music community.” It brings together coders, artists, and musicians to build models to generate sound, images, and words. For the moment, Eck is focusing on music. “My whole career has been about music and audio,” he says.15 But, he insists, the scope of Magenta has always encompassed art and storytelling as well as music.
For now, the goal of Magenta is to provide a creative feedback loop for artists, musicians, and computers. Says Eck, “Art and technology have always coevolved.”16 He cites guitarists Les Paul and Jimi Hendrix in the 1950s and 1960s. Les Paul, an American jazz, country, and blues guitarist, was one of the pioneers of the solid-body electric guitar and the inventor of the Les Paul Standard, the Holy Grail of electric guitars. His guitars have an extraordinary sweetness of sound, as anyone who has ever played one—including Keith Richard and other legendary guitarists—will attest.
Hendrix took the Les Paul guitar and pushed its boundaries, amplifying it so much that it produced extreme distortion, taking the sound into another realm. In the same way, photography was initially developed to capture physical reality, but photographers soon put paid to that with intentional distortions, overexposed film, and other effects, essentially destroying the original instrument, as Hendrix did.
To Eck, the engineers at Magenta are more Les Paul and the artists more Jimi Hendrix. The essence of Magenta’s existence is to be a creative feedback loop, a push and pull, between us and AI, increasing our human creativity. At present, fully autonomous creative machines are not on Eck’s radar. He doesn’t want to step back and “watch a machine create art,” he says. Rather his aim is to “increase our creativity as people [with] a cool piece of technology to work with.”17
As soon as Magenta was up and running, Eck set to work to bring his first project to life: to enable a machine to compose. The result was the small melody described at the beginning of this part, the first music composed entirely by a computer without being in any way programmed to do so. As described previously, Eck started by feeding in 4,500 popular tunes, from which the computer learned the rules for writing this particular sort of music. Then he seeded the computer with four musical notes—two Cs and two Gs. The computer then produced a ninety-second melody played on a Musical Instrument Digital Interface (MIDI) synthesizer (linking an electronic instrument and a computer), with percussion added to give it rhythm.
Eck was rather horrified when his creation of the little melody was snapped up by the media and proclaimed the world’s first machine-composed melody. In fact, it created a sensation. Nick Patch of the Toronto Star wrote that the “age of the computer composer is finally upon us—even if the first machine-written tune sounds like a phoned-in Sega Genesis score.”18 The Star even invited three musicians and a singer to improvise on the melody in their own styles.
Ryan Keberle, a music professor in New York, complained that he “found [the tune] to be lacking in musical depth.” Hardly surprising, but indicative of the seriousness with which many people took it. Although it was a computer-produced melody, it was not as simple as it might at first sound. There were variations made possible by LSTM.
There were many things wrong with this first musical venture. Says Eck, the machine “doesn’t really understand musical time.” It played only quarter notes and was not “attention-based.” In other words, rather than being able to isolate a particular part of a musical score and focusing only on that, the neural net spread its attention around different parts of the score, as well as on everything going on around it.
Eck was also aware that the tune it had composed entirely lacked expressiveness: it was totally flat. To highlight what was missing, he programmed a computer to generate Beethoven’s Moonlight Sonata and hooked it up to an electronic piano, which gave a robotic performance from the MIDI score. (A MIDI device can also output a musical score from a computer.) It was perfect, but totally lacking in feeling. Then a pianist sat down at the same piano and played the same piece from the same MIDI score. But he naturally modulated pressure on the keys, used the pedals, and brought in his own personality and feelings. As Eck puts it, “The performer is telling us a story that is not in the score. … The story emerges from a combination of performer, score and real world.”19
The next challenge, then, was to find a way to develop a machine that could create more expressive music. Two engineers at Magenta, Ian Simon and Sageev Oore, the latter of whom is a classical pianist and jazz musician, worked on this. They developed Performance RNN, “an LSTM recurrent neural network designed to model polyphonic music with expressive timing and dynamics.”20
To create the Performance RNN version of Moonlight Sonata, they trained their neural network on the Yamaha Piano-e-Competition dataset, feeding in some 1,400 performances by skilled pianists, captured on a MIDI device. Oore played a keyboard attached to a MIDI that converted the input into data (numbers), which formed the raw material of the algorithm. In the MIDI, there are note-on and note-off events—when the pianist presses a key and when he releases it. MIDI data included the pitch of the note, its velocity (its volume), and how hard the key was pressed.
From this, Simon and Oore generated sequences of bars, complete with notes and dynamics. Then they expanded their training set by varying the initial data, increasing the length of each performance and changing the pitch or quality of sound by up to a major third—for example, from C to E.
Using this data, they set up their machine to generate music in steps of ten milliseconds, which offered more expressiveness in the timings of the notes, adding in a mixture of quarter and sixteenth notes to give variety and meaning. This provided more notes in intervals of seconds, resulting in the high density of notes appropriate to a musical score, as opposed to the primitive structure of the earlier melody. As the performances became longer, an LSTM stored what had gone before.
To control the model’s output, Simon and Oore used a parameter they called temperature, which determines how random the machine’s output is. At a temperature of one, the machine is left to produce any sounds it pleases. Lowering the temperature reduces the amount of randomness, but this tends to result in repetition. Raising the temperature results in random bursts of notes.
In the end, the music generated was only interesting for thirty seconds or so and then began to lose long-term structure. Nevertheless, this minimalist approach was a step in the right direction: to create a neural network trained entirely on a database and not programmed to do anything in particular—trained end to end, in other words. To Eck, the result was exciting. The machine wasn’t just generating notes: it was making decisions, “deciding how fast [the notes were] being played, and how loudly.”21
Recently Eck and his colleagues have been looking for a way to generate longer sequences of notes. At the end of 2018 they brought out the Music Transformer. Eck describes it as “a genuine breakthrough in the neural network composition of music.”22
Performance RNN not only degrades after only a few seconds, it also forgets the input motif almost immediately. The Music Transformer, on the other hand, keeps the input motif intact while playing it against the background of the music it’s been trained on. 23 This is closer to what musicians actually do. In fact the Music Transformer produces coherent music for more than twice as long as Performance RNN. In technical terms RNNs can only access specific parts of the music the neural network has been trained on, while the Music Transformer can access all of it while maintaining attention on the motif. Eck and his colleagues see the Music Transformer as a way for musicians to explore a range of possibilities for developing a given motif. An inspiring example is the way in which the Music Transformer plays with the motif in Chopin’s Black-Key Étude, keeping it intact while improvising around it.
As head of Project Magenta, creativity is always on Eck’s mind. I ask him about volition, whether machines can create works of art on their own. His first response is, “I predicted this question on my bike ride in this morning. It’s the right question. This is the sort of thing I think about when I’m falling asleep or when I have a scotch.”24
The present political climate has made him think about dystopian novels like George Orwell’s Animal Farm and 1984, he continues. How close are we to developing a computer capable of producing works like these? It seems impossibly far off, he concedes. “But who knows?” He doesn’t know when machines will have “what looks to us to be actual volition.” And when they do, people will question whether it truly is volition, to which the only answer will be the one that David Ferrucci gave when asked whether IBM Watson could think: “Can a submarine swim?” Whenever machines show glimmers of creativity, people raise the bar of what should be defined as creative.
It will take a generational shift to change this, says Eck. Today we have virtual pop stars like Hatsune Miku in Japan, a Vocaloid hologram who has thousands of fans. Young people are less hidebound about the concept of creativity; indeed, her computer nature is what gives Miku her popularity.
Eck is encouraged by the way Go enthusiasts respond to AlphaGo and its unpredictable moves and in particular to the fact that the Go community refers to the machine as “she” and described AlphaGo’s key move in the second game of its match with Lee Se-dol as “beautiful.” They didn’t get caught in the trap of saying, “Well, if it were made by a human it would have been beautiful.” Here, says Eck, “we can clearly attribute beauty to the machine.”25
Eck believes that imposing boundaries on computers when they compose music will help to ensure they create works that we can appreciate and enjoy. He gives as an example serialism, which Arnold Schoenberg initiated early in the twentieth century with his twelve-tone technique, based on the twelve notes of the chromatic scale (each a semitone above or below its adjacent pitches), a limitation that conversely created new freedoms of expression. It was, says Eck, an interesting way to compose music, but people were simply not equipped to appreciate it. From this he concludes that there are limits to the sounds we can appreciate. “All happens in small leaps. You can’t move from Mozart to the Sex Pistols in one generation.” The music generated by creative machines, he suggests, will be of a kind so different from what we can even imagine—owing to our own cognitive limitations—that people will find it cacophonous, as happened at first with Schoenberg’s serial music, until he modified it.
Another example Eck gives of the necessity of imposing limitations, or constraints, on creativity is the way in which he formed Project Magenta. “When I created Magenta, it was clear there had to be some constraining factors. Otherwise you’re all over the place. For me the constraining factor was that I believed that deep learning, explicitly, has been a very productive and interesting revolution in machine learning, that would tie deep neural networks in with RNNs and reinforcement learning,” he tells me. In other words, he had a definite plan in mind, rather than simply taking account of everything in AI. Thus inspired, he worked with engineers from Google Brain to create Magenta and tied it in with TensorFlow, Google’s software library and machine-learning framework. He also chose to utilize end-to-end learning, in which the computer essentially teaches itself by being fed data, with no further outside input. This is the basis of Google’s very successful translation process, which took twenty years to work out. “Finally we got it right,” says Eck. “That’s what we’re trying to do with music.”