Ultra Ethernet Testing

Optimizing Ethernet for AI & HPC networks

Ethernet for AI and HPC Networks

Why networks are advancing through Ultra Ethernet and related technologies

Ethernet for AI and HPC networks must support fundamentally different traffic patterns than traditional data center environments. Large‑scale AI training and inference rely on collective communication, synchronized traffic bursts, and ultra‑low latency behavior across both scale‑up and scale‑out fabrics. These requirements are driving a new generation of Ethernet technologies optimized for high bandwidth, fast loss recovery, and predictable performance.

Ultra Ethernet represents an architectural evolution of Ethernet designed specifically for AI and HPC workloads. It introduces mechanisms such as Link‑Layer Retry, advanced congestion handling, and transport behavior optimized for AI fabrics. At the same time, physical interfaces are advancing rapidly, with Ethernet switch and NIC ports scaling to 400G, 800G, and up to 1.6 Tbps, enabled by 224G SerDes technologies.

Validating Ethernet devices for AI and HPC requires more than throughput testing. Network equipment must be tested for loss behavior, congestion response, latency sensitivity, and interoperability under realistic AI traffic conditions. This includes validation of Ultra Ethernet features, scale‑up and scale‑out topologies, and next‑generation high‑speed Ethernet ports.

Our test platforms are designed to support comprehensive testing of Ethernet for AI and HPC networks, including Ultra Ethernet functionality, 1.6 Tbps port speeds, and 224G SerDes interfaces. They enable deterministic, scalable, and repeatable validation of network performance, reliability, and readiness for deployment in demanding AI and HPC infrastructures.

AI & UE Solution Track icon

Learn more about our AI & UE Solution Track

Scale‑Up Ethernet (SUE): High‑Bandwidth, Low‑Latency Communication

Scale‑up Ethernet refers generally to an architectural concept for Ethernet deployments optimized for high‑bandwidth, low‑latency communication within tightly coupled AI and HPC domains, such as intra‑node and intra‑rack environments. Some vendors have taken this a step further by defining specific protocol extensions or frame formats reflecting different implementation choices aimed at addressing the demanding performance requirements of scale‑up workloads.

In the Ai and HPC domains, Ethernet is used to interconnect GPUs, accelerators, and other compute devices that participate in collective operations and synchronous workloads. The primary requirements are extreme bandwidth, deterministic latency behavior, and highly predictable performance under bursty traffic patterns.

As AI systems scale, bandwidth demands increase rapidly, driving Ethernet link speeds from 400G to 800G and onward to 1.6 Tbps, enabled by next‑generation 224G SerDes technologies. At the same time, latency sensitivity becomes more pronounced, making consistency and low jitter as critical as peak throughput. Even small variations in delivery timing can impact the efficiency of collective operations and overall training or inference performance.

Testing scale‑up Ethernet fabrics therefore focuses on validating bandwidth scaling efficiency, latency determinism, and loss recovery behavior across short paths and minimal hop counts. This can include verification of Ultra Ethernet capabilities alongside high‑speed physical interfaces, ensuring Ethernet implementations can reliably support high‑performance GPU‑to‑GPU and accelerator communication in demanding AI and HPC systems.

Scale up Ethernet testing

Scale‑Out Ethernet: Building Large AI and HPC Fabrics

In large AI and HPC systems, Ethernet is used to interconnect compute resources across racks, forming fabrics that scale to thousands of nodes. These scale‑out environments support distributed training and inference workloads where communication spans many devices and switch hops, and where overall system performance depends on the coordinated behavior of the entire network rather than individual links.

As fabric size increases, the primary challenges shift from raw point‑to‑point bandwidth to resiliency, congestion management, and fairness at scale. Packet loss, congestion hotspots, or link failures can affect a large number of flows simultaneously and disrupt collective operations. Ensuring predictable behavior across many hops, while maintaining efficient utilization of shared network resources, is essential to sustaining throughput and minimizing iteration time in large‑scale AI workloads.

To meet these demands, scale‑out Ethernet designs commonly incorporate a variety of mechanisms aimed at managing congestion, absorbing faults, and maintaining fairness across competing traffic streams. Vendors have historically implemented different combinations of queueing strategies, congestion control techniques, telemetry, and transport optimizations to address these challenges in large, multi‑hop fabrics. These implementation choices reflect differing approaches to achieving reliable and efficient Ethernet operation at scale, while the term scale‑out Ethernet itself remains focused on deployment characteristics and workload requirements rather than any single standardized protocol.

Ultra Ethernet introduces architectural mechanisms intended to improve reliability, congestion handling, and recovery behavior in AI‑focused Ethernet fabrics. In scale‑out environments, such capabilities can complement existing Ethernet technologies by helping to limit the impact of loss or congestion propagation across many hops, contributing to more predictable fabric‑level behavior without prescribing a specific topology or vendor implementation.

Testing scale‑out Ethernet fabrics therefore extends beyond validating link speeds or isolated devices. Network equipment must be evaluated under realistic, multi‑hop topologies and distributed traffic conditions to assess congestion response, fairness, fault recovery, and interoperability at scale. This includes exercising Ultra Ethernet features alongside established Ethernet behaviors to ensure large AI and HPC fabrics operate efficiently, reliably, and predictably under demanding real‑world workloads.

Ultra Ethernet

Ultra Ethernet is being developed within the Ultra Ethernet Consortium (UEC), an industry collaboration focused on evolving Ethernet to better support the performance, scale, and reliability requirements of AI and HPC networks.

Ultra Ethernet introduces a set of architectural mechanisms that improve reliability, determinism, and performance in high‑speed Ethernet fabrics. Rather than defining an entirely new protocol, Ultra Ethernet utilizes and extends existing Ethernet mechanisms – including discovery, framing, and encoding behaviors – to enable faster loss recovery, more effective congestion handling, and better coordination between connected devices.

Link Layer Discovery Protocol (LLDP)

In Ultra Ethernet environments, LLDP is used as a foundational mechanism for capability discovery and fabric awareness. Ultra Ethernet introduces UE‑specific Type‑Length‑Value (TLV) extensions within LLDP that allow directly connected devices to advertise and negotiate support for Ultra Ethernet features.

UE‑specific Type‑Length‑Value (TLV) extensions within LLDP

These TLVs enable peers to exchange information about supported capabilities, operational modes, and feature eligibility, forming a consistent understanding of link behavior before data traffic flows. This capability discovery helps ensure that Ultra Ethernet enhancements are applied only where both ends of a link support them, contributing to predictable and interoperable operation across heterogeneous AI and HPC fabrics.

Link‑Layer Retry (LLR)

Link‑Layer Retry (LLR) introduces localized Layer‑2 retry mechanisms that enable rapid recovery of lost Ethernet frames between directly connected devices. Ultra Ethernet leverages existing Ethernet framing and encoding constructs while introducing explicit identification of LLR‑eligible and LLR‑ineligible frames.

At a high level, this includes the use of preamble signaling and PCS block‑coding indications to differentiate traffic types, as well as dedicated control frames (CtlOS frames) used to coordinate retry behavior between peers. By performing retransmission locally at the link layer, LLR reduces reliance on higher‑layer end‑to‑end recovery and limits the propagation of transient loss, helping maintain low latency and more deterministic performance in both scale‑up and scale‑out environments.

Credit-Based Flow Control (CBFC)

Credit-Based Flow Control (CBFC) enhances how Ethernet fabrics respond to congestion under the highly synchronized and incast‑heavy traffic patterns common in AI workloads. Ultra Ethernet builds on established Ethernet congestion signaling concepts while introducing more explicit and timely congestion and backpressure communication between devices.

By enabling congestion conditions to be signaled and acted upon closer to where they occur, CBFC helps prevent congestion from spreading across the fabric and improves fairness among competing flows. This contributes to more predictable performance and efficient utilization of shared resources, particularly in large AI and HPC fabrics with many parallel traffic streams.

Why Testing Ultra Ethernet Details Matters

While Ultra Ethernet enhancements are subtle at the frame, preamble, and encoding levels, their correct implementation is critical to fabric behavior. Capability discovery via LLDP TLVs, correct identification of LLR‑eligible traffic, accurate handling of control frames, and proper PCS block‑coding behavior must all interoperate precisely across devices.

Testing Ultra Ethernet therefore goes beyond throughput or link‑up validation. Network equipment must be verified at the Ethernet frame and symbol level to ensure correct interpretation, signalling, and interaction of these mechanisms under realistic AI traffic conditions. Validating these details is essential to achieving the intended gains in reliability, determinism, and performance in production AI and HPC networks.

UE and AI Solutions from Xena