AI

AI model using AMD GPUs for training hits milestone

Zyphra, AMD, and IBM spent a year testing whether AMD’s GPUs and platform could support training AI models at scale, and the ZAYA1 was the result.

In partnership, the three companies have trained the ZAYA1 — described as the first major Expert Mix core model built entirely on AMD GPUs and networking — which they see as proof that the market doesn’t have to rely on NVIDIA to scale AI.

The model was trained on AMD’s Instinct MI300X chipset, Pensando Networks, and ROCm software, all running over IBM Cloud infrastructure. What is notable is how traditional the setting looks. Instead of experimental hardware or arcane configurations, Zyphra built the system like any enterprise suite — just without the NVIDIA components.

Zyphra says ZAYA1’s performance is on par with, and even exceeds in some areas, established open models in reasoning, mathematics and programming. For companies frustrated by supply constraints or rising GPU prices, this amounts to something rare: a second option that doesn’t require compromising on capacity.

How Zyphra used AMD GPUs to cut costs without impacting AI training performance

Most organizations follow the same logic when planning training budgets: memory capacity, connection speed, and expected iteration times are more important than initial theoretical throughput.

The MI300X’s 192GB of high-bandwidth memory per GPU gives engineers some breathing room, allowing early training to be done without immediately resorting to heavy parallelism. This tends to simplify projects that are fragile and time-consuming to fine-tune.

Zyphra built each node with eight MI300X GPUs connected via InfinityFabric and paired each one with its own Pollara network card. A separate network handles data set reads and scans. It’s a simple design, but that seems to be the point; The simpler the wiring and network layout, the lower the switching costs and the easier it is to keep redundancy times consistent.

ZAYA1: An AI model that punches above its weight

The ZAYA1 base activates 760 million parameters out of a total of 8.3 billion and is trained on 12 trillion codes in three stages. The architecture relies on compressed attention, an improved routing system to direct tokens to the right experts, and lightweight residual scaling to keep the deeper layers stable.

The model uses a combination of Muon and AdamW. To make Muon efficient on AMD hardware, Zyphra has consolidated cores and reduced unnecessary memory traffic so that the optimizer does not dominate each iteration. Batch sizes have increased over time, but this is largely dependent on having storage lines that can deliver tokens quickly enough.

All of this leads to an AI model trained on AMD hardware that competes with larger peers like Qwen3-4B, Gemma3-12B, Llama-3-8B, and OLMoE. One advantage of the MoE structure is that only a small portion of the model is run at one time, which helps manage inference memory and reduces service cost.

For example, a bank can train a domain-specific model to perform investigations without requiring complex parallelization ahead of time. The MI300X’s memory overhead provides engineers room for iteration, while the ZAYA1’s compact attention span reduces pre-fill time during evaluation.

Make ROCm behave with AMD GPUs

Zyphra has made no secret of the fact that porting its mature NVIDIA-based workflow to ROCm takes work. Instead of blindly porting components around, the team spent time measuring how AMD’s hardware worked and reconfiguring model dimensions, GEMM patterns and microbatch sizes to fit the MI300X’s preferred compute domains.

InfinityFabric works best when all eight GPUs on the node participate in clusters, and Pollara tends to reach peak throughput with larger messages, so the Zyphra-sized merge caches accordingly. Long-context training, from 4k to 32k tokens, relies on loop attention for fragmented sequences and tree attention during decoding to avoid bottlenecks.

Storage considerations were equally practical. Smaller models put up IOPS; Larger devices need sustained bandwidth. Zyphra aggregates dataset fragments to reduce sparse reads and increases per-node page cache to speed up checkpoint recovery, which is vital during long runtimes where rewinds are inevitable.

Keep groups on their feet

Training jobs that last for weeks rarely work perfectly. Zyphra’s Aegis service monitors logs and system metrics, identifies failures such as NIC glitches or ECC codes, and automatically takes immediate corrective actions. The team also increased RCCL timeouts to prevent short network outages from killing entire jobs.

Checkpoints are distributed across all GPUs rather than forced via a single checkpoint. Zyphra reports more than ten times faster saves compared to naive methods, which directly results in improved runtime and reduced operator workload.

What the ZAYA1 AMD Training Guru Means for AI Buying

The report draws a clean line between the NVIDIA ecosystem and its AMD equivalent: NVLINK vs InfinityFabric, NCCL vs RCCL, cuBLASLT vs hipBLASLT, etc. The authors argue that the AMD stack is now mature enough to develop a serious model on a large scale.

None of this suggests that organizations should tear up existing NVIDIA clusters. A more realistic path is to keep NVIDIA for production with AMD using phases that take advantage of the memory capacity of the MI300X GPUs and the openness of ROCm. It spreads supplier risks and increases overall training volume without significant disruption.

All this leads us to a set of recommendations: treat the shape of the model as adjustable and not fixed; Design networks around the group processes your training will already use; Building a fault tolerance system that protects GPU clocks instead of just logging failures; And updating checkpoints so as not to deviate from the training rhythm.

It’s not a statement, it’s just our practical conclusion from what Zyphra, AMD, and IBM have learned by training a large model of MoE AI on AMD GPUs. For organizations looking to expand AI capabilities without relying on just one vendor, this has the potential to be a useful blueprint.

See also: Google is committed to providing 1,000 times more AI infrastructure over the next four to five years

Want to learn more about AI and Big Data from industry leaders? Check out the Artificial Intelligence and Big Data Expo taking place in Amsterdam, California and London. This comprehensive event is part of TechEx and is co-located with other leading technology events including the Cyber ​​Security Expo. Click here for more information.

AI News is powered by TechForge Media. Explore other upcoming enterprise technology events and webinars here.

Don’t miss more hot News like this! Click here to discover the latest in AI news!

2025-11-24 18:07:00

Related Articles

Back to top button