Meta unleashes Llama API running 18x faster than OpenAI: Cerebras partnership delivers 2,600 tokens per second

Join daily and weekly newsletters to obtain the latest updates and exclusive content to cover the leading artificial intelligence in the industry. Learn more
Meta today has announced a partnership with Cerebras Systems to run the new Llama applications interface, providing developers to up to 18 -time inference speeds faster than GPU traditional solutions.
This announcement, which was released at the Meta’s Llamacon Developer Conference in Menlo Park, puts the company to compete directly with Openai and Google, in the rapidly growing AI inference service market, where developers buy distinctive symbols by billions to run their applications.
“META chose the brain to cooperate to connect the very rapid reasoning they need to serve developers through their new Lamaa applications interface,” Julie Shen Choi, chief marketing official in Kerberra, said at a press conference. “We are really in Kerbras, are really excited to announce the first partnership of CSP Hyperscale to provide a quick reasoning for all developers.”
The partnership represents the official Meta entry into the work of selling the artificial intelligence account, which transforms the famous open source Lama models into a commercial service. While Meta’s Llama Trackers accumulated more than a billion download, so far the company has not provided a basic cloud with a primary party for developers to build applications with them.
“This is very exciting, even without talking about the brain specifically,” said James Wang, chief executive of Kerbras. “Openai, Anthropic, Google – they have built a completely new AI’s business from the zero point, which is artificial intelligence inference. Developers who are building AI applications will buy codes in millions, sometimes by billions. These are similar to the new account instructions that people need to build AI applications.”
Break the speed barrier: How are the super -brain models
What distinguishes the Meta width from each other is the increase in the dramatic speed provided by the artificial intelligence chips specialized in Cerebras. The brain system provides more than 2,600 icons per second for Llama 4 Scout, compared to about 130 icons per second for ChatGPT and about 25 icons per second with Deepseek, according to the standards of artificial analysis.
“If you only compare the API-TO-API, Gemini and GPT basis, they are all great models, but they all work at GPU speeds, which are about 100 code per second,” Wang explained. “And 100 code per second is fine to chat, but it is very slow for thinking. It’s very slow for agents. People are struggling with that day.”
This speed feature provides completely new categories of applications that were previously impractical, including factors in actual time, low audio systems to join the conversation, generate the interactive code, and immediate multi-step thinking-and all require a sequence of multiple linguistic calls that can be completed now in seconds instead of minutes.
The Llama Application Programming interface is a major transformation in the Meta Ai strategy, as it is transmitted from being a typical provider until it becomes a company for the infrastructure of the Nioma AI. By providing API service, Meta creates a flow of revenue from its artificial intelligence investments while maintaining its commitment to open the models.
“Meta is now in the field of selling distinctive symbols, which is great for the American ecosystems of artificial intelligence,” Wang said during the press conference. “They bring a lot to the table.”
API will provide tools for adjustment and evaluation, starting from the Llama 3.3 8b, allowing developers to create and train data and test the quality of their custom models. Meta emphasizes that it will not use customer data to train its own models, and the models designed using API Llama can be transferred to two other hosts – a clear distinction from the most closed methods of some competitors.
Cerebras will provide the new Meta service through its network of data centers across North America, including facilities in Dallas, Oklahoma, Minnesota, Montreal and California.
“All data centers that serve the reasoning are present in North America at this time,” explained by Choi. “We will provide dead with the full ability of the brain. The work burden will be balanced across all these different data centers.”
The work ranking follows what Choi described as “the classic account provider of the excessive performance model”, similar to how NVIDIA has devices for the main cloud providers. She said, “They keep our account blocs, which can serve the residents of the developers.”
Besides the brain, Meta has also announced a partnership with GROQ to provide rapid reasoning options, giving developers multiple high -performance alternatives that go beyond traditional inference based on graphic processing unit.
Entry to Meta to the API market can be inferred with superior performance measures to disable the applicable arrangement that is dominated by Openai, Google and Anthropic. By combining the popularity of its open source models and the capabilities of great conclusion, META defines itself as a huge competitor in the area of commercial artificial intelligence.
“Meta is in a unique position with 3 billion users, a severe data center, and an ecosystem for the huge developer,” according to the presentation of CEREBRAS. Merging brain technology “Meta Leapfrog Openai and Google help in about 20x.”
For the brain, this partnership represents a major milestone and verifying the authenticity of the specialized artificial intelligence devices. “I built this engine on the scale of biscuits for years, and we always knew that the first benefit of technology, but in the end it should end as part of another person’s cloud. This was the ultimate goal of the trade strategy perspective, and we finally reached this teacher,” Wang said.
Llama API is currently available as a limited inspection, as Meta plans to launch a broader week and months. Developers interested in accessing Llama 4 ultra -speed reasoning can ask for early access by choosing the brain from the model options within the API Llama.
“If you imagine a developer who does not know anything about the brain because we are a relatively small company, they can only click two button on the standard Meta SDK, create the API key, choose the brain mark, then suddenly, the symbols are processed on a giant chips.” Wang explained. “This type of ultimately made us the background of the Meta’s full developer’s ecosystem is great for us.”
Meta choosing a specialized silicone indicates a deep thing: in the next stage of artificial intelligence, not only what your models know, but how quickly you think about it. In this future, the speed is not just a feature – it’s the main point.
Don’t miss more hot News like this! Click here to discover the latest in Technology news!
2025-04-29 20:02:00