A new, open source text-to-speech model called Dia has arrived to challenge ElevenLabs, OpenAI and more

Join daily and weekly newsletters to obtain the latest updates and exclusive content to cover the leading artificial intelligence in the industry. Learn more
An emerging company from two people called Nari Labs Dia, a text model to a step to the TTS Airlines of 1.6 billion (TTS) designed to produce a natural dialogue directly from the text claims-one of the creators claims that it exceeds the performance of royal offers competing with one of the first place in Gogloglm AI.
It can also threaten the absorption of the GPT-4O-MINI-TTS from Openai.
“The Dia Rivals Notebooklm feature while exceeding ElevenLabs and the Sesame Form in Quality,” Toby Kim, one of Nari and DIA.
In a separate publication, Kim pointed out that the model was designed with “zero financing”, and it was added on the topic: “… We were not experts of artificial intelligence from the beginning. Everything began when we fell in love with the Notebooklm podcast feature when it was released last year. We wanted more control over the voices.
Kim made more Google to give him and his collaboratory to reach the Topus processing unit chips (TPUS) to train DIA through the Google Cloud Research.
DIA icon and Ozan Conditions – Interior Models Communication Group – now available for domestic download and publishing by anyone from Huging Face or GitHub. Individual users can try to generate speech on embracing face space.
Advanced control and more customized features
DIA supports accurate features such as emotional tone, headphones signs, and non -verbal vocal signals – all of this from an ordinary text.
Users can mark the speaker with signs like [S1] and [S2]It includes signals such as (laughter), (coughing), or (wiping the throat) to enrich the resulting dialogue with non -verbal behaviors.
These signs are properly explained by DIA through the generation – something that does not reliably support it through other available models, according to the company’s examples page.
The model is currently in English only and is not associated with the voice of any one speaker, which produces different sounds for each process unless users reach the seed of the generation or submit a sound claim. Voice adaptation, or audio cloning allows users to direct the tone of speech and semi -sound by downloading a clip sample.
Nari Labs provides an example to facilitate this process and an existing explanation for graduates so that users can try it without preparing.
Compared to eleven and sesame
Nari offers a range of audio files created by DIA on its website, and compared to their main rivals in the text, specifically from SEFENLABS Studio and Sesame CSM-1B, which is the latter of text to new plans from Oculus VR Co-CONTORTOR Brendan Iripe, which was displayed early this year.
Examples include side by side, Nari Labs shared how DIA outperforms the competition in several areas:
In standard scenarios, DIA directs both natural timing and non -verbal expressions better. For example, in a text program that ends with (laughs), it explains Dia and actually laughing, while EnevenLabs and Sesame Output Extruptions such as “haha”.
For example, you are Dia …
… and the same sentence that Elevenlabs Studio speaks
In multiple conversations with emotional range, Dia shows the most smooth transformations and tone shifts. One test included a dramatic, emotional, dramatic scene. DIA has made urgency and stress effectively, while competing models often flatten the birth or lost speed.
DIA is uniquely deals with non -verbal texts, such as comic exchange that includes coughing, smelling and laughing. Competitive models have failed to identify or skip these signs.
Even with a rhythmic complex content like rap words, Dia generates liquid speech, similar to the performance that maintains a pace. This contrasts with more monotonous or broken outputs from eleven Sesame 1B.
Using sound claims, DIA can extend or follow the headphone sound style to new lines. An example of using a conversation clip as a seed that explains how Dia carried sound features from the sample through the rest of the text dialogue. This feature is not strongly supported in other models.
In one set of tests, Nari Labs pointed out that the best offer for Sesame’s Sesame website has used an internal version of the model instead of the 1B checkpoint, which leads to a gap between the announced and effective performance.
The arrival of the model and technical specifications
Devils can reach DIA from the Nari Labs GitHub warehouse and embracing face model page.
The model works on PyTorch 2.0+ and Cuda 12.6 and requires about 10 GB of VRAM.
Inference on the Foundation’s graphics processing units such as NVIDIA A4000 provides about 40 code per second.
While only the current version is run on GPU, Nari plans to provide the CPU support and a quantitative version to improve access.
The start starting offers the Python Library and the Cli tool for more publishing.
DIA flexibility opens up use cases from content creation to auxiliary technologies and artificial audio comments.
Nari Labs is also developing a DIA consumer version that aims at unusual users looking to re -mix or share the created conversations. Users interested can sing e -mail to a waiting list for early access.
The source is completely open
The form is distributed under the fully open source Apache 2.0 license, which means that it can be used for commercial purposes – which is clear that it would like institutions or independent applications.
Nari Labs is explicitly prohibited to use the impersonation of individuals, spreading wrong information, or engaging in illegal activities. The team encourages the responsible experimentation and took a position against unethical publishing.
Google Tpu Research Cloud, Grant Zerogpu Huging’s Face, former work on Soundstorm, Parakeet, and Decredio Audio Codec.
Nari Labs itself includes only engineers-one full-time and part-time-but they are actively invoking the contributions of society through the Discord and GitHub server.
With the clear focus on expressive quality, cloning, and open access, DIA adds a distinctive new sound to the scene of obstetric speech models.
Don’t miss more hot News like this! Click here to discover the latest in AI news!
2025-04-22 17:48:00