FlashLabs Researchers Release Chroma 1.0: A 4B Real Time Speech Dialogue Model With Personalized Voice Cloning
Chroma 1.0 is a real-time speech-to-speech dialogue model that takes audio as input and returns audio as output while preserving speaker identity across multi-turn conversations. It is presented as the first open source spoken dialogue system that combines low-latency interaction with high-fidelity personal audio reproduction from just a few seconds of reference audio.
The model works directly on discrete speech representations rather than textual representations. It targets the same use cases as real-time commercial agents, but with a built-in 4B parameter dialog core and a design that treats speaker similarity as a primary goal, not an ancillary feature. Chroma achieves a 10.96% relative improvement in speaker similarity over the human baseline and reaches a Real Time Factor (RTF) of 0.43, so it can generate speech twice as fast as playback.

From cascade ASR ➡️ LLM ➡️ TTS ➡️ End-to-end S2S
Most production assistants still use a three-stage pipeline, automatic speech recognition for audio-to-text conversion, a large language model for inference, and text-to-speech synthesis. This architecture is flexible but introduces latency and loses paralinguistic information such as timbre, emotion, speaking rate, and prosody once the system converts audio to text. In real-time dialogue, the loss of audio detail directly damages the fidelity and naturalness of the speaker.
Chroma follows the newer class of speech-to-speech systems that identify between sequences of codec tokens. The speech coder and neural encoder produce quantitative acoustic codes. The language model then infers and responds via a sequence that interleaves textual symbols and phonetic symbols, with no clear intermediate text. This keeps the model conditional on tone identity and speaker identity throughout the entire processing chain.
Architecture, reason + speech generation stack
Chroma 1.0 has two main subsystems. Chroma Reasoner handles multimodal understanding and text generation. The speech suite, Chroma Backbone, Chroma Decoder, and Chroma Codec Decoder, converts this semantic output into a personalized vocal response.
Chroma Reasoner is built on the Qwen-omni series Thinker module and uses the Qwen2 audio coding pipeline. It processes text and audio inputs with common front-ends, combines them with cross-modal attention, and aligns them over time using time-aligned multi-modal rotary position embedding (TM-RoPE). Output is a series of hidden states that carry both linguistic content and acoustic cues, for example rhythm and emphasis.


Chroma Backbone is a 1B parameter LLaMA pattern model based on Llama3. It is conditioned to the target sound using the CSM-1B, which encodes a short reference audio clip and texts it into the embedding prompts pre-appended to the sequence. During inference, symbolic implicatures and hidden states are fed from the mind as a unified context, so the backbone always sees the semantic state of the dialogue while generating the audio symbols.
To support the flow, the system uses a fixed interleaving table of 1 to 2. For each text token of the reason, the backbone produces two audio tokens. This allows the model to start producing speech as soon as text generation begins and avoid waiting for complete sentences. This interleaving is the main mechanism behind the reduced time to the first token.
The Chroma decoder is a lightweight LLaMA variant containing about 100 million parameters. The backbone only predicts the first codebook to quantize the remaining vectors for each frame, which is an approximate representation. The decoder then takes the hidden spine state and the first symbol and regression-wise predicts the remaining RVQ levels within the same window. This analysis preserves the temporal structure of the long context in the backbone and constrains the decoder to frame local optimization, which reduces computation and improves detailed representations and expressiveness.
The Chroma Codec decoder concatenates the coarse and refined codes and maps them to waveform samples. It follows the decoder design of the Mimi vocoder and uses a causal convolutional neural network such that each output sample depends only on the previous context, which is required for streaming. The system uses 8 code books, which reduces the number of automatic decoder optimization steps while maintaining sufficient detail for sound reproduction.
Prepare training and synthetic speech-to-speech (S2S) data.
High-quality speech dialogue data with strong logical signals are rare. Chroma therefore uses a speech-to-speech (S2S) pipeline. A reasoner such as LLM first produces textual answers to the user’s questions. The TTS then synthesizes the target speech that matches the reference timbre of those answers. These synthetic pairs train the backbone and decoder to perform acoustic modeling and sound reproduction. The reason remains frozen and acts as a provider for including text and multimedia hidden states.
Quality of audio reproduction and comparison with current systems
The objective assessment uses the SEED-TTS-EVAL protocol on CommonVoice English speakers. The Chroma operates at a sampling rate of 24 kHz and achieves a speaker similarity score of 0.81. The human baseline is 0.73. CozyVoice-3 reaches 0.72 and most other TTS baselines fall below the human reference. The research team reports that this represents a relative improvement of 10.96% over the human baseline, suggesting that the model captures fine linguistic details more consistently than human recordings at this measure.


Subjective evaluation compares Chroma with ElevenLabs model eleven_multilanguage_v2. In the normal CMOS condition, listeners preferred ElevenLabs 57.2% of the time versus 24.4% for Chroma, with 18.3% deuce. In CMOS speaker similarity, the results were very close, 42.4% for ElevenLabs and 40.6% for Chroma, with 17.0% deuce. A follow-up test asking which sound is more natural between ElevenLabs and the original recordings results in a 92.0% preference for ElevenLabs versus 8.0% for the ground truth, showing that tangibility and speaker fidelity are incompatible.
Latency and real-time behavior
Latency is measured with a single synchronous current. For a response of 38.80 seconds, the total generation time is 16.58 seconds, giving a real time factor (RTF) of 0.43. The reasoner contributes 119.12 ms TTFT, the backbone 8.48 ms, and the decoder 19.27 ms per frame on average. Codec Decoder works on groups of 4 frames, so TTFT does not apply to this component. The total time to the first code is 146.87 ms, which is well under 1 second and suitable for interactive dialogue.


Spoken dialogue and standards of logic
Chroma is evaluated on the basic track of the URO Bench. It uses only 4B parameters but achieves an overall task completion score of 57.44%. The GLM-4 Voice, the 9B parameter model, leads with 69.09%. Chroma ranks second overall and outperforms many of the 7B and 0.5B baselines in several dimensions. It reaches 71.14% on Storal, 51.69% on TruthfulQA and 22.74% on GSM8K. For measures of oral conversation, the highest scores were on MLC at 60.26% and on CommonVoice at 62.07%.


More importantly, the Chroma is the only model in this comparison that supports custom audio reproduction. All other systems focus on spoken dialogue and logical reasoning only. This means Chroma delivers competitive cognitive power with real-time, high-precision audio personalization.
Key takeaways
- End-to-end real-time speech to speech: Chroma 1.0 is a 4B parameter spoken dialogue model that maps speech to speech directly using codecs, avoids explicit ASR and TTS phases and preserves presentation identity and speaker identity through the entire path.
- The reason is in addition to the structure of the speech stack: The system combines a Qwen-based Chroma Reasoner, a LLaMA 1B-style backbone, a Chroma 100M decoder, and a Mimi-based Codec decoder, and uses RVQ codebooks and an interleaved 1-to-2 text-to-voice schedule to support broadcast and low time to first token.
- Powerful personal voice reproduction: In SEED-TTS-EVAL with CommonVoice speakers, Chroma reaches a speaker similarity score of 0.81 at 24 kHz, which is reported as a 10.96 percent relative improvement over the human baseline of 0.73 and outperforms CosyVoice 3 and other TTS baselines.
- Sub-second latency and faster than real-time generation: Single stream inference on the H200 GPU produces a total time to first token of about 147 ms, and for a response of 38.80 seconds, the model generates the audio in 16.58 seconds, resulting in a real-time factor of 0.43 which is twice as fast as playback.
- Competitive dialogue and reasoning with replication as a unique feature: In the URO Bench base track, Chroma achieved a 57.44 percent overall task completion rate and competitive scores in Storal, TruthfulQA, GSM8K, MLC, and CommonVoice.
verify Paper, typical weights, project and stadium. Also, feel free to follow us on twitter Don’t forget to join us 100k+ mil SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.
Don’t miss more hot News like this! Click here to discover the latest in AI news!
2026-01-22 02:22:00

