AI & Music: Using GPT-3 As A Drum Machine! š„
GPT-3's Language-To-Music Knowledge Transfer: Exciting New Paper

In an exciting new paper, Li Zhang & Chris Callison-Burch showcase how language models like OpenAIās GPT-3 can be fine-tuned to [drumroll š„] ā¦
⦠act as drum machines. š¤Æ
In this post, weāll find out how they did it, how to replicate it, and how my experiment in turning GPT-3 into a Middle Eastern drum machine went. Buckle up, this is exciting new stuff!
Teaching GPT-3 To Be A Drummer
In āLanguage Models Are Drummersā Zhang and Callison-Burch present preliminary results on a method for automatic music generation using GPT-3.
Yes, thatās right: The same GPT-3 that everyone is raving about right now for its incredible ability to generate texts is now taking the stage as a means of generating music.
In their approach, Zhang and Callison-Burch present a method for transferring GPTās knowledge of language to music by fine-tuning the regular GPT-3 model with just a few hundred MIDI files.
Sounds intriguing, right?
Hereās their straightforward approach:
- From Googleās Groove MIDI Dataset, a collection of 1,150 MIDI files and over 22,000 measures of drumming from 10 professional drummers, Zhang & Callison-Burch filtered out a few hundred grooves by style, length, and time signature (Western Rock/Pop, 16 measures, 4/4) for simplicity. MIDI (Musical Instrument Digital Interface) is a protocol standard that allows electronic musical instruments to connect and communicate with each other. Music in MIDI files is stored as note, pitch, and instrument type, among other things, and the data is machine readable.

2. The MIDI has then been transformed into a multi-line string, called ādrumroll formatā, where a measure of music corresponds to 16 lines of text and each line in that text corresponds to a 16th note. So, each of the selected 16-bar drum grooves is represented as text (the domain of GPT-3): 16 columns with 16 lines of text.

3. Finally, GPT-3 was fine-tuned with the text data, whereby the first two measures of each groove (2 columns of 16 lines of text) represented the prompt and the following fourteen measures (fourteen columns of 16 lines of text) represented the desired completion.
And that was it.
The fine-tuned model was then able to take any given 2-bar prompt (presented to GPT-3 in the ādrumrollā format) and turn it into a 16-bar drum groove.
GPT-3 was not only copying its input but managed to create new grooves within the musical style it has been fine-tuned with ā where the latest DaVinci model showed much better quality than the cheaper and faster Ada model. That is pretty insane!

Of course, there are still some errors slipping into the resulting drum grooves that a professional human drummer would not make, but these can be fixed, Zhang & Callison-Burch argue, by further refinement of the fine-tuning method. Evaluating the strengths and weaknesses of their approach, they come to the conclusion that ālanguage-to-music transfer learning with large language models is viable and promisingā.
Experiment: GPT-3 As a Middle Eastern Drum Machine
Viable and promising is good enough for the ethnomusicologist in me and so I tried to tackle a specific case: fine-tuning GPT-3 to generate a popular middle-eastern groove called āSemai Al Thaqilā which sounds pretty foreign to the untrained ear since it is built on a ten-beat-structure ā not the western standard of 4/4.
Here youāll find an explanation of the rhythm:






