AI

Pushing the frontiers of audio generation

Techniques

Published
Authors

Zalán Borsos, Matt Sharifi and Marco Tagliasacchi

Clarifying the patterns of speech, repetitive progress in generating dialogue, and a comfortable conversation between two votes.

Our leading speech techniques help people all over the world to interact with more natural digital aids, conversations, intuition and artificial intelligence tools.

Speech is essential for human communication. It helps people around the world exchange information and ideas, express feelings and create mutual understanding. As our technology to generate natural and dynamic sounds continues to improve, we open more richer and more attractive digital experiences.

Over the past few years, we have been pushing the boundaries of sound generation, developing models that could create a high -quality natural discourse from a set of inputs, such as texts, rhythm control and certain sounds. This technology runs the individual sound in many Google products and experiments-including Gemini Live, Project ASTRA, flight sounds and Auto Auto- and people around the world help interact with more natural digital assistants, conversations, intuitive, and tools of Amnesty International.

By working with partners all over Google, we recently helped develop two new features that can generate a long -term multi -range dialogue to make complex content easier:

  • Notebooklm Voice Overview of the downloaded documents into an attractive and vital dialogue. With one click, two hosts of artificial intelligence summarize the user materials, and make contacts between topics and joking back and forth.
  • Illuminate creates official discussions created from artificial intelligence on research papers to help make knowledge easier and digest.

Here, we offer a general overview of the latest research generation research on which all these experimental products and tools are based.

Leading techniques for sound generation

For years, we were investing in sound generation research and exploring new ways to generate more natural dialogue in our experimental products and tools. In our previous search on Soundstorm, we first showed the ability to generate 30 seconds of normal dialogue between multiple loudspeakers.

This is our previous work.

Soundstream is a nervous audio coding program efficiently and breaks the insertion of sound, without prejudice to its quality. As part of the training process, Soundstream learns how to set the sound to a set of audio codes. These symbols pick up all the information needed to rebuild the sound with high sincerity, including properties like Prosody and Timbre.

Audiolm is treated as a task of language modeling to produce sound codes from coding programs such as Soundstream. As a result, the Audiolm frame does not provide any assumptions about the type or installation of the sound that is created, and it can deal with a variety of sounds flexible without the need for architectural adjustments-which makes it a good candidate for multiple speaking dialogues.

An example of a multi -edge dialogue created by a voice overview of Notebooklm, based on a few documents related to potatoes.

Based on this research, our latest speech generation technology can produce two minutes of dialogue, with the improvement of naturally and the consistency of speakers and voice quality, when it is given a text from the dialogue and signs of the name. The model also carries out this task in less than 3 seconds on the tensioner -tensioner chip (TPU) V5E, in a single conclusion corridor. This means that the sound generates more than 40 times faster than the real time.

Simling our sound generation models

Expanding the scope of one -match generation models to multiple speakers, then became the issue of data and capacity of the model. To help our latest speech generation model on the production of the speech sectors longer, we have created a more efficient sound coding program in a series of symbols, in less than 600 -bit per second, without prejudice to the quality of their production.

The symbols produced by our coding programs contain hierarchical structure and are assembled by time frames. The first symbols inside a group capture sound and reference information, while the last symbols are encoded from the exact audio details.

Even with our new speech coding program, a two -minute dialogue production requires more than 5,000 icons. For the model of these long serials, we have developed a specialized transformer structure that can efficiently handle the hierarchical sequence of information, and to match our voice symbols.

With this technique, we can efficiently generate vocal codes that correspond to the dialogue, as part of a single automatic conclusion pass. Once you create these symbols, these symbols can be decoded in a sound wave form using our speech coding program.

Animation that shows how our speech generation model produces an automatic flow of audio symbols, which are decoded again into a wave shape consisting of a speaker dialogue.

To teach our model how to generate realistic exchange between multiple speakers, we installed it over hundreds of thousands of words. After that, we created it on a much smaller data collection of dialogue with high audio quality and accurate speakers conditions, consisting of unwritten conversations from a number of realistic vocal and experienced actors – “Umm” and “ah” from the real conversation. This step has learned how to reliably switch between loudspeakers during a conversation created and only the studio quality sound with realistic stopping, tone and timing.

In line with the principles of our artificial intelligence and our commitment to developing and publishing artificial intelligence techniques responsibly, we integrate our technology that we will find on the audio content generated from artificial intelligence from these models, to help protect from the potential use of this technology.

New speech experiences in front of us

We are now focusing on improving our model fluency, sound quality, and adding more granules controls for features, such as Prosody, while exploring the best ways to combine these developments and other methods, such as video.

Possible applications to generate advanced speech, especially when they are associated with our Gemini family. From enhancing learning experiences to make content easier in the world, we are excited to continue to pay the limits of what is possible with sound techniques.

Thanks and appreciation

The authors of this work: Zaln Porsus, Matt Sharifi, Brian McWilloms, Yunning Lee, Damian Vincent, Felix de Chomont Fitry, Martin Sondeerir, Eugene Khartonov, Alex Tiuor, Victor Onguriano, Carlets Miss. Tagliasacchi.

We thank LELAND Rechis, Ralph Leith, Paul Middleton, Poly Pata, MinH Trueg and RJ Skerry-Ryan for their critical efforts on dialogue data.

We are very grateful for our collaborators through laboratories, lighting, cloud, speech, and YouTube for their distinguished work that brings these models to products.

We also thank François Boufais, Krishna Bahrat, Tom Hume, Simon Tokumin, James Zhao for their guidance in the project.

2024-10-30 15:00:00

Related Articles

Back to top button