OpenAI’s new voice AI model gpt-4o-transcribe lets you add speech to your existing text apps in seconds

Join daily and weekly newsletters to obtain the latest updates and exclusive content to cover the leading artificial intelligence in the industry. Learn more
Openai’s Voice AI models have made a problem with actor Scarlett Johansson, but this does not prevent the company from continuing to make its offers in this category.
Today, ChatGPT Maker has unveiled three new special sound models: GPT-4O TRANSLSIBE, GPT-4O-MINI-TRARARIBE and GPT-4O-MINI-TTS. These models will be available at the beginning through the Chatgpt Maker applications interface (API) for third -party program developers to create their own applications. It will also be available on a dedicated pilot site, Openai.fm, individual users can access for limited and fun tests.
]
Moreover, the sounds of the GPT-4O-MINI-TTS model from several groups can be dedicated to the text router to change their accents, a stadium, tone and other audio characteristics-including the transfer of any emotions that the user requests, which must spend a long way to do so, but he must address any concerns of any concerns. Now, it is up to the user to report how they want their voice from artificial intelligence when talking again.
In a experimental show with Venturebeat was delivered via a video call, Jeff Harris, an Openai artistic employee, showed how, using the text on his own on the experimental site, the user can get the same sound to look like a crazy world or a quiet yoga teacher.
Discover and refine new capabilities within the GPT-4O base
Models are the current GPT-4O variables that were launched again in May 2024, which currently runs the ChatgPT text and vocal experience for many users, but the company took this basic model and then trained with additional data to make it outperform copies and speech. The company did not specify when the forms may come to Chatgpt.
Harris said: “Chatgpt has a little different requirements in terms of cost and performance, so while I expect to move to these models in time, at the present time, this launch focuses on API users,” Harris said.
It is supposed to replace the two -year -old Openai Open Source Open Source, which provides lower words error rates through industry standards and improving performance in loud environments, with various dialects, and in changing speech speeds across 100 languages.
The company posted a scheme on its website that shows the low GPT-4O models’ errors in identifying words across 33 languages compared to whisper-with a decrease in 2.46 % in the English language.
Harris said: “These models include abolishing noise and semantic audio activity detector, which helps to determine when the headset ends in thinking, which improves the accuracy of copies.”
Harris Venturebeat told the new GPT-4O TRECRIBE to the new GPT-T4O TRECRIBE model is not designed to present “notes”, or the ability to name and distinguish between different loudspeakers. Instead, it was designed primarily to receive one (or perhaps multiple sounds) as one input channel and respond to all inputs with one output sound in this reaction, no matter how long.
Company He also hosted a general audience competition to find the most creative examples of the use of Openai.fm and sharing online by putting a mark on the Openai account on X. The winner will get a dedicated engineering radio with Openai Logo, which Openai said as the head of the product, said Olivier Godment is one of only three in the world.
Voice applications of gold mine
These improvements make especially suitable for applications such as customer communication centers, observation copies, and artificial intelligence aides.
Incalously, the newly launched SDK agents also allows developers who have already built applications over their large text language models such as the regular GPT-4O to add liquid audio reactions with only “nine lines of software instructions”, according to a presenter through Openai Youtube Livestream, which announces the new covered models Above).
For example, the electronic commercial application designed on top of the GPT-4O can now reply to the user’s rotating questions like “Tell me about my recent requests” in speaking with the code just by adding these new models.
Harris said: “For the first time, we present the words to the flowing text, which allows developers to enter the sound continuously and receive a text flow in the actual time, which makes the conversations more normal.”
However, for those looking for actual time actual time, Openai recommends using speech models to speak in API in actual time.
Pricing and availability
The new models are available immediately via the API from Openai, with prices as follows:
• GPT-4O TransIn: 6.00 dollars per 1 million symbols inserting sound (~ 0.006 dollars per minute)
• GPT-4O-MINI-RANSCRIBE: 3.00 dollars per 1m sound input codes (~ 0.003 dollars per minute)
• GPT-4O-MINI-TTS: 0.60 dollars per 1m text input, $ 12.00 per 1m audio output codes (~ 0.015 dollar per minute)
However, they pray At a time of severe competition ever in the area of copying and speech of artificial intelligence, with allocated AI companies, such as ElevenLabs, which introduces its new writer model, which supports diarrhea and features a similar error rate (but not low) by 3.3 % in English. At a price $ 0.40 per hour of the input sound (or $ 0.006 per minute, which is almost equivalent).
Another startup, Hume Ai, introduces a new model, Octave TTS, with a sentence of a sentence and even at the words level for speech and emotional reflection-based on user instructions, not any prior sounds. The pricing of Octave TTS cannot be compared directly, but there is a free layer that offers 10 minutes of sound and costs from there
Meanwhile, the most advanced sound and speech models also come to the open source community, including one called Orpheus 3B which is available with the permissible Apache 2.0 license, which means that developers do not have to pay any costs to run – provided they have the right devices or cloud servers.
Adopting industry and early results
According to the certificates that Openai with Venturebeat, many companies have already incorporated new OPENAI audio models on their platforms, and reported significant improvements in AI’s audio performance.
ELISEAI, a company that focused on the automation of property management, found that the Openai model from the text to words had enabled natural and emotional reactions with tenants.
Self -timing, maintenance, and tastiest sounds have made more attractive, which leads to high tenant satisfaction and improve calling rates.
Decgon, which builds a 30 % -working audio experience, has witnessed a 30 % improved copy use using the speech recognition model in Openai.
This increase in the accuracy of Decagon AI allowed more reliable performance in the real world scenarios, even in loud environments. The integration process was fast, as Decagon merged the new model into its system within one day.
Not all reactions to the latest version of Openai were warm. Dawn Ai App Analytics Software Ben Hylak (Benhylak), a former designer of Apple Human Interfaces, has published, although the models seem promising, the advertisement “looks like a truly sound time”, indicating the transformation of the previous Openai’s previous concentration on AI conversation Low via ChatGPT.
In addition, the launch has already been released early on X (Twitter). Testingcatalog News (Testingcatalog) has published details of the new models several minutes before the official announcement, as the names of GPT-4O-MINI-TTS, GPT-4O TVRIBE and GPT-4O-MINI-CRIBS were included. The credit for the leakage was to @stiventhedev, and the publication soon gained traction.
However, wait forward, Openai plans to continue to improve sound models and explore the designated audio capabilities while ensuring safety use and artificial intelligence consent. Overseas, Openai also invests in multimedia intelligence, including video, to enable more dynamic and interactive experiences based on the agent.
2025-03-20 18:21:00