NVIDIA Teases “World’s Most Flexible Sound Machine”, Fugatto

Semiconductor manufacturer and the world’s most valuable company NVIDIA today shared a preview of Fugatto, an AI-powered audio tool that they describe as “the World’s Most Flexible Sound Machine”.

Fugatto is intended to be a sort of Swiss Army Knife for audio, letting you generate or transform any mix of music, voices and sounds using just text prompts.

“Fugatto is our first step toward a future where unsupervised multitask learning in audio synthesis and transformation emerges from data and model scale,” says composer & NVIDIA researcher Rafael Valle.

Here’s the official teaser video:

Like earlier generative audio demos, many of the audio examples in their promo seem primitive. On the other hand, this is first generative AI demo that we’ve seen that also showcases the tool being used in interesting creative ways.

For example, the video demonstrates how you can use text prompts with Fugatto to extract vocals from a mix, morph one sound into another, generate realistic speech, remix existing audio, and convert MIDI melodies into realistic vocal samples. These are capabilities that c0uld actually complement and extend the capabilities of the current generation of digital audio workstations.

Here’s what they have to say about the technology behind Fugatto:

“Fugatto is a foundational generative transformer model that builds on the team’s prior work in areas such as speech modeling, audio vocoding and audio understanding.

The full version uses 2.5 billion parameters and was trained on a bank of NVIDIA DGX systems packing 32 NVIDIA H100 Tensor Core GPUs.

Fugatto was made by a diverse group of people from around the world, including India, Brazil, China, Jordan and South Korea. Their collaboration made Fugatto’s multi-accent and multilingual capabilities stronger.

One of the hardest parts of the effort was generating a blended dataset that contains millions of audio samples used for training. The team employed a multifaceted strategy to generate data and instructions that considerably expanded the range of tasks the model could perform, while achieving more accurate performance and enabling new tasks without requiring additional data.

They also scrutinized existing datasets to reveal new relationships among the data. The overall work spanned more than a year.”

We’ve got a lot of questions about Fugatto, ranging from “When will Fugatto become a real thing?” to “Will the data centers needed to power Fugatto generate enough heat to bring ocean-front views to the Midwest?”

But, with this demo, we can see a paradigm-shift coming in how musicians work with audio – where text-based and spoken commands become an important part of musician’s toolkits.

This is step towards a future Synthtopia predicted back in 2010, in 10 Predictions For Electronic Music Making In The Next Decade:

“Music software will get smarter – the state of the art in digital audio workstations is amazing. But, by and large, DAW manufacturers are still making virtual versions of traditional hardware studios. Most soft synths still look and act like their hardware predecessors, and that’s what buyers are demanding.

At this point, imitating traditional studios is horseless carriage thinking – letting what we can imagine be defined by the past. In the next decade, music software is going to get smarter and interfaces will make bolder leaps. You’ll tell your computer that you want to make an drum and bass track and your DAW will anticipate the way you’ll want your virtual studio configured. Ready get started? Say “gimme a beat!” You’ll interact with your DAW to “evolve” new sounds. You’ll hum the bassline and your DAW will notate it. You’ll build the track by saying that you want a 32 measure intro and a drop down to the bass and then bring the kick back in after 16 measures. You’ll draw a curve on a timeline to define the shape of your track, do a run through and improvise over the rhythm track.

Then you’ll tell your DAW to add a middle eight and double the bassline and to master it with more “zazz” and it will be saved in the cloud for your fans to listen to.”

We may have been optimistic on that timetable. But it’s clear that – for at least younger musicians – we’re heading for an era where the ‘virtual studio’ paradigm of current DAWs may not be relevant anymore. For someone new to music production, being able to remix audio and arrange music using voice commands will make it much easier to get started with music production.

And, for those of us that have invested years in developing skills with audio software – it’s clear that new audio tools are coming, very quickly, that promise to let us work with audio in new ways. It seems inevitable that some of the capabilities demonstrated in this video will be integrated into the next generation of digital audio workstations.

Is generative audio about to get interesting for creative musicians? Or are you sick of hearing about how AI is going to get awesome? Share your thoughts in the comments!

36 thoughts on “NVIDIA Teases “World’s Most Flexible Sound Machine”, Fugatto

  1. “Fugatto was made by a diverse group of people from around the world, including India, Brazil, China, Jordan and South Korea.” So was the atomic bomb. Do you want one in your living room? I love marketing blurbs. They apply to everything and nothing simultaneously.

  2. The “converting a melody to vocal samples” ability is pretty sonically endless. And being able to use your own samples/loops as starting points in conjunction with text descriptors is pretty powerful as well. I would like to see/hear if they could generate FX as well. Could you actually type in specific FX settings in your text descriptors? Like a reverb type I.e. “Grand Canyon with 60% wet, and 2 second ping pong delay, etc.” could it create convolutions, either with text descriptors or even actual numeric settings?

    Based on the teaser I would have tons of uses for this, and the melody to vocal ability has me intrigued…just using the vast classical repertoire would be a blast to play with, JMHO but this has quite a few rabbit holes I could enjoy exploring.

    1. I agree in principle, but the generated vocal didn’t actually follow the midi melody, so I’m not sure how useful it would be yet. Might still be useable in some applications.

  3. The midi bit doesn’t actually copy the midi at all, after the first couple of notes. I can see this would be cool for never-heard-before sound effects though.

  4. This is where things are going…it looks like they’re here already? Any perceived lack of ‘quality’ on the part of the consumer will be remedied within a few years or less, just as the silly AI-generated images of hands with more than 5 fingers and smiles with impossible amounts of teeth, are being remedied as well. Just as the great American composer Charles Ives had a career in insurance, and Robert Fripp of King Crimson has a healthy income in real estate holdings, the musician/sound designer/synthesis of the future is well advised to have a ‘Plan B’ (‘not’ the abortion pill, or possibly?) in their hip pocket; you have the freedom to do anything you like in the arts, but the commercial arts & entertainment world isn’t going to pay you for work that can be done by an algorithm and a crafty writer of AI prompts…sucks, doesn’t it? To accept the fact that the end consumer doesn’t give a rat’s butt about whether a Ben Burtt-level sound designer crafted a groovy sound effect in their video game or film/tv thing. We’re a niche, and we just became obsolete…the predictably dull result of evolution vs. creation, right? Woo-hoo…

    1. Don’t forget that Bob Moog had a little side line called Robert A. Moog Automobile Components, mainly steering, drive train, shocks and brakes, which is a big going company to this day. Yamaha can tell you a thing or two about diversification.

    1. Food stamps can’t be used to buy animal food. I had to buy canned chicken and salmon, so my cats ate like kings while I ate ramen in a cup. This is American justice.

  5. This marketing is for normies excited about potentially displacement of entire industries by this killer app. I couldn’t stake a workflow on this as is I’d spend more time fighting it to sound good thenl learning how to make things sound good or applying what I know.

    1. If you have something good to start with, getting good use out of something like this is just a bonus. Seems like a great program to run ideas through for music. Possibly a powerful tool for things like foley on a budget. And, since there’s no DAW that I can see, we can totally check out your new dawless jam.

  6. Chamwow for the win! Its all about the workflow. I understand the WHAT, but the WHY is still hazy, as with any new technology. I’m kind of piano-centric. I love my DAW, but for me, its mainly about knowing what to grab and when. Talking to my rig is still a distant second (or third or fourth) to simply laying hands to things.

    I’m not putting it down. I can see some merits in it. Its just not refined enough for me to alter my hard-won paradigm this early in. Making music shouldn’t be too easy. Its more about putting the right kinds of sweat into your work.

  7. Though I can absolutely imagine this kind of tech being used in a creative way by a creative person, it seems to me like most of the demos in this space are more interested in supplanting/bypassing human creative processes, than enhancing them.
    There are def some practical / technical things for which tools like this are helpful, (removing vocals, splitting to stems etc). But mostly I get the feeling that the intent for creating such tools is not actually coming from an inspired place, in the sense of ‘furthering human creativity’. Probably the process of developing it is a creative effort, within the realm of software development, but those are two different drives.

  8. Who would be interested in this and why?
    Who: recording studios, producers, and music executives.
    Why: cost reduction.
    As with everything AI-inspired, follow the money: AI could help discover cures for cancer but fake cat videos make more money.
    It’s not about what the tech does; but who controls it and what they want you to use it for.

    Oh, and: who owns the rights?

  9. Its like browsing an infinite sample library where you have to type a long sample description and it will blurt out something you never wanted.

  10. Im super happy this is happening in the audio space, video, programming, image, writing.
    Why?
    Because now everyone can feel guilt free diluting the industry with good enough content. The real magic is going to be when they let prisons have access to the tools and ability to earn money by letting AI do all the work. Send a kite with a prompt. Who needs a studio, if a non artist wants to make art, tools shouldnt matter. Life is an illusion anyways.. so good enough quality is better than nothing.

  11. Love it. Is it available already?
    Ps: Lot’s of negative comments here, why? If you don’t like it you don’t have to use it.
    On the other hand it can be a great tool for “writers block” moments 😉
    Like it or not AI is the future and is here to stay…

  12. Pros:
    1. Solve some previously difficult problems, like isolating tracks, etc.
    2. Carefully write prompts and see if the AI can interpret your intent.
    3. Get good starting points for creative work.
    4. Take all the shit jobs composers don’t want to do like pimple cream commercials, so that composers can focus on making good art.

    Cons:
    1. Deep fakes will destroy everything.
    2. Energy consumption from crypto and AI computation farms will have massive carbon footprint at a time when we need to be doing the opposite.
    3. AI will affect some employment opportunities for musicians, composers, voice-over talent, actors, writers, directors, animators, etc. etc. etc.
    4. People’s focus will shift away from the fun & gratifying processes to joyless prompt writing.

    The irony is that as the tech gets more powerful, a person’s ability to write clear and well-communicated ideas for prompts becomes more critical. English teachers everywhere will feel vindicated.

    As for me, I’ll keep doing what I do. Eventually, a AI will write prompts for other AI to do various things. I’ll be outside looking at stuff.

  13. mentioned this in a bit of a jumbled way before, so maybe a second attempt is warranted.
    One question— Is it likely that tools of this nature will help to enrich human creative potential? (enrich?optimize imo)

    I’m not vehemently for or against any type of creative tool, but it does seem to me that the creators of these sort of tools are not typically inspired by that as a primary goal.

    Considering the motivational differences between creating content and creating art, is this sort of thing pushing toward (and maybe celebrating) a reality where the latter takes the place of the former? Like more of the “creative people need to make more shit, more quickly” mindset, that tools like Daniel Ek have vomited into the collective consciousness.

  14. Bottom line is LOTS of time and effort is invested in producing crap for media consumers who aren’t so discerning. Plenty of “throw away” music, films, and books have been put out and consumed by folks who couldn’t have cared less how it was produced. Why NOT use AI to save time and effort producing “throw away” content for the non-discerning masses?

    That being said, I’m getting really tired of wading through an ever-increasing flood of fully AI-produced crap (shorts, music, etc) on YouTube and other media sites.

  15. Many of these kinds of thing are already available or being developed, and fortunately many of them are open source, or at least the models are.
    UVR is a very good stem maker/track isolator and is free.
    Suno and Udio can create entire songs/lyrics from a text prompt as well as take a piece of your music and output it differently based on a text prompt.

Leave a Reply

Your email address will not be published. Required fields are marked *