Riffusion: The Stable Diffusion of Music


3/8/20233 min read

Riffusion - A Theme about Artificial Intelligence Replacing Musicians
Riffusion - A Theme about Artificial Intelligence Replacing Musicians

In this post we explore Riffusion, an AI-Powered software that generates music from simple text prompts.

If you're not living under a rock, you've probably heard of ChatGPT, an AI algorithm that can generate coherent human-like text from simple prompts.

But there are also other AI-powered "artists" out there, such as Dall-e and Stable Diffusion, which can turn text prompts into images.

For those of us who are passionate about the sound and music industry, we are already used to be the ones late to the party, and many of us are already thinking: "AI is starting to create anything, but what about music?"

If there is already text-to-image and text-to-speech software available, there should be some software that works as a "text-to-music", right?

It shouldn't surprise anyone by now, but yes, there is already a software, capable of generating music from text prompts, just like Stable Diffusion is capable of generating images from text-promts.

And it is conveniently named:


What is a Riffusion?

Riffusion is a groundbreaking neural network designed and developed by Seth Forsgren and Hayk Martiros that is capable of generating music using images of sound instead of audio.

Yes, you read that right, it generates music by generating images... of sound.

Let us explore this a little deeper:

Riffusion, at its core, is essentially a fine-tuned version of Stable Diffusion, an open-source model for generating images from text prompts, but applied to spectrograms.

It applies the same principles of image generation utilized in Stable Diffusion (with the use of img2img functionality) but purely focused on the interpolation and looping of images.

The real magic happens when these generated images go through a computation process called "Short-time Fourier transform" which in conjunction with the use of Torchaudio can extrapolate all the information provided in the spectrograms into audio waves that translate into music.

It is truly fascinating from a technological standpoint, and also full of technical jargon that needs to be used to fully explain in deep detail how it works.

For those of you wanting to venture into that technical side, feel free to click any of the links provided above in green.

Or better yet, have a little tour on the technology provided by the creators themselves:


TL;DR: Riffusion is a version of Stable Diffusion that essentially turns its text-to-image capabilities into a text-to-music software.

How to use Riffusion?

If you're familiar with AI prompt-based software, you already know how to use Riffusion.

And even if you aren't, you'll probably know how to use it after just a couple of minutes of trying it out.

As any other kind of prompt-based AI-generated content, all you have to really do, is access Riffusion's website, write your idea and let the AI work its magic.

You will be presented with a simple and straight forward 1-page interface where you can start experimenting around.

To simplify even further, here is a listing of the everything you can do at Riffusion's Website:

Riffusion Interface Explained
Riffusion Interface Explained
Riffusion Settings Interface
Riffusion Settings Interface

Riffusion Interface

  1. Prompt Box: This is where you can write your prompts, and probably 90% of what you'll ever need to use Riffusion, it's that simple.

  2. Random Prompt: If you don't feel like coming up with any prompts, you can just click the random prompt button and see what it comes up with.

  3. Play Button: It's a play button, pretty straight forward

  4. Share Button: You can share your creation with a simple link

  5. Settings: Here you can play with a couple of extra settings if you want to explore a little deeper with its functionality.

  6. Debug: A tool for programmers who want to help find and resolve bugs with the code.

  7. Seed Image: Here you can alter a little bit of what kind of spectrogram the software uses as a reference

  8. Denoising: A variable to experiment around, to let the algorithm take more or less "creative liberties"


While the technology is not still quite there, we can see how Riffusion is an exciting new technology that has the potential to revolutionize the music industry, just like Stable Diffusion and Midjourney are revolutionizing the world of visual art.

With its ability to generate music using images of sound, use text prompts, and interpolate different files together, Riffusion offers new possibilities for music creation and shortens the music production process.

If you're looking for a new and innovative way to create music, Riffusion is definitely worth exploring.


Related Stories