StepFun Introduces Step-Audio-AQAA: A Fully End-to-End Audio Language Model for Natural Voice Interaction

Reflection on the interaction between man and computers in sound
The machines that can respond to human speech with the expressive and natural voice on an equal goal have become a major goal in smart reaction systems. The vocal language modeling extends this vision by combining speech recognition, understanding of natural language, and sound generation. Instead of relying on text transfers, models in this space aim to understand the sound alone. This is crucial not only for access and comprehensive access, but also to achieve more interactions between the machine, such as humans in applications such as voice assistants, voice -based stories, and hands -free computing.
Speech pipeline restrictions
Despite the developments in the understanding of the sound, a clear challenge remains: most systems still depend on a series of separate stereotypes of speech into the text, processing the text, and converting the text into words. This standard approach can destroy performance and respond due to accumulated errors and cumin. Moreover, these pipelines lack expressive control, making them inappropriate for accurate tasks such as emotional dialogue or dynamic speech creation. The ideal solution will be a completely unified model capable of understanding an audio question and creating a direct expressive audio answer, thus getting rid of all the text -based mediation.
One of the distinctive code for the distinctive Lalms is fully unified
I tried several ways to treat this. Early methods, such as HugingGPT and AudiogPt, were used, the consecutive structures that combined speech models and separate language. While they used to cover tasks, these systems struggled with real -time audio interaction. Later, businesses, such as Vall-E, Probergpt, AudioPalm, QWEN2-AUDIO, have presented systems based on the distinctive symbol, the sound turns into separate representations. However, even these models often come out a text and require separate young, which limits their ability to produce expressive and immediate audio responses.
Submit Step-UADIO-AQAA: AQAA system from end to end
The researchers at Stepfun Step-Udio-Aqa, a large model of sound languages specially designed for Audio Voice Answers. Unlike the previous models, Step-UADIO-AQAA converts the spoken inputs directly to the expressed out of the spoken without converting it into an intermediate text. This architecture combines the distinctive symbol of dual books, and the primary spine with a size of 130 billion known called Step-omni, and a superior to the flow to synthesize natural speech. The integration of these components allows smooth and total reaction.

Distinguished symbol, architecture and audio control
This method begins with two separate audio features – one for linguistic features and the other for semantic. The linguistic distinctive, based on Paraformer, extracts organized speech elements like audio at 16.7 Hz using a symbol of 1024 symbols. Meanwhile, the distinctive semantic symbol (inspired by Cosyvoice 1.0) codes the sound wealth in 25 Hz with 4,096 symbols. This is tangled in 2: 3 and passed in Step-UMNI, which is only the LLM-trained unit of the text-trained on text, sound and pictures. Next, the model takes out the TI-Code Blossoms out of the audio and text, which Vocoder turns into a liquid discourse. This setting allows controlling sound, including emotional tone and speech rate.
Standard evaluation and results
The model was evaluated using Stepeval-Udio-360 standard, which includes multi-language multi-language audio tasks across nine categories, including creativity, games, emotion control, role-playing, and sound understanding. Compared to modern models such as Kimi-UADIO and QWEN-UMNI, Step-UADIO-AQAA has achieved the highest level of opinion in most categories. Specifically, in the experiences of the distinctive symbol of the text, the training achieved 10:15 higher performance with chat degrees (4.03), importance (0.65), and facts (0.67). Among the various audio communication technologies, a sequence was done to better maintain marks, with chat (4.22), importance (0.57), and facts (0.57). These numbers reflect their strength in generating precise, emotional, and shield in the context.

Conclusion: Towards the discourse of the expressive machine
STEP-UADIO-AQAA offers a strong solution to normative speech processing restrictions. By combining the distinctive symbol of expressive sound, multimedia LLM, and post -training strategies such as improving direct preference and models, it succeeds in generating high -quality audio responses and emotionally. This work represents an important step forward in enabling machines to communicate with speech that is not only functional, but expressive and liquid.
verify Paper and model embracing. All the credit for this research goes to researchers in this project. Also, do not hesitate to follow us twitter And do not forget to join 100K+ ML Subreddit And subscribe to Our newsletter.

Niegel, a trainee consultant at Marktechpost. It follows an integrated double degree in materials at the Indian Institute of Technology, Khargpur. Nichil is a fan of artificial intelligence/ml that always looks for applications in areas such as biomedics and biomedical sciences. With a strong background in material science, it explores new progress and creates opportunities to contribute.

Don’t miss more hot News like this! AI/" target="_blank" rel="noopener">Click here to discover the latest in AI news!
2025-06-16 08:17:00