AI Infrastructure

Testing networks hosting AI applications

AI has entered the data center

The rapid growth of Artificial Intelligence (AI) applications is reshaping how data centers are designed and optimized. Modern AI workloads—especially Large Language Models (LLMs) and deep learning—demand enormous computational power, massive datasets, and extremely low‑latency data movement across the network. To meet these stringent requirements, data centers rely on AI accelerators, GPUs, and a range of high‑performance interconnect technologies. These include lossless Ethernet‑based transports such as RoCEv2 and Ultra Ethernet Transport (UET), as well as non‑Ethernet fabrics like InfiniBand, which are designed to deliver both low latency and high bandwidth. Combined with next‑generation high‑speed Ethernet links reaching up to 1.6 Tbps, this infrastructure enables AI systems to process data in real time and scale efficiently across distributed compute environments.

To ensure that networking hardware performs reliably at scale, operators and manufacturers need realistic, hardware-based testing. However, building full scale test networks is costly and time consuming, which drives the need for effective, hardware-based testing tools.

Spine and leaf topology

AI clusters often consist of thousands of interconnected accelerators that generate heavy east‑west, lossless traffic with strict low‑latency requirements. A spine‑and‑leaf network topology supports this by providing predictable, high‑bandwidth connectivity. Leaf switches connect directly to servers and storage devices, while each leaf is linked to every spine switch. This minimizes hop count, prevents blocking, enables efficient load balancing, and scales easily as more switches are added.

Infiniband

InfiniBand is a dedicated high‑performance networking technology widely used in HPC and AI clusters for its ultra‑low latency and extremely high bandwidth. Unlike Ethernet‑based transports, InfiniBand uses a specialized lossless fabric, hardware offloads, and adaptive routing mechanisms to provide consistent performance at massive scale.

Remote Direct Memory Access over Ethernet (RoCE)

RDMA over Converged Ethernet (RoCE) enables low‑latency, high‑throughput data transfers across Ethernet networks. By allowing direct memory access between AI accelerators, servers, and storage, RoCE reduces CPU involvement, lowers latency, and supports the real‑time performance demands of large‑scale AI training workloads.

Ultra Ethernet Transport (UET)

Ultra Ethernet Transport adds enhancements to standard Ethernet to improve congestion control, reliability, and transport‑layer performance for AI and HPC environments. Features such as Link Layer Retry (LLR), Credit‑Based Flow Control (CBFC), and advanced link‑negotiation mechanisms help ensure deterministic low latency, efficient large‑scale load balancing, and resilient performance at the link layer under the extreme traffic patterns typical of modern AI clusters.

Remote Direct Memory Access over Ethernet (RoCE)

RDMA over Converged Ethernet (RoCE) is a protocol facilitating low-latency, high-throughput data transfers over Ethernet networks. RoCE enables direct memory access between AI accelerators and storage servers, minimizing CPU involvement and reducing latency.

Managing congestion is essential in AI infrastruture

Detecting and avoiding congestion is crucial for maintaining end-to-end low-latency and lossless performance in AI infrastructures. Congestion control is often managed through Priority Flow Control (PFC) and Explicit Congestion Notification (ECN).

ECN gives an early indication that a queue is starting to fill up and avoids congestion by slowing down the traffic rate of the relevant CoS from the sender. PFC takes congestion control one step further by temporarily stopping traffic of the CoS from the sender. Optimizing an AI network’s performance will typically involve tuning the thresholds Kmin and Kmax for ECN.

How to test congestion in an AI infrastructure

The Z800 Freya Ethernet Traffic Generator can be used to verify if a switch under test handles congestion correctly, and to fine-tune the ECN thresholds for optimal performance. The traffic generators can vary the traffic rates for different priority flows entering the switch’s queues and verify the assertion of PFC, as well as the ratio between packets marked ECN=‘11’ and ‘10’.

Automating advanced test cases and integrating them with your own scripts is easy using our open-source Xena OpenAutomation (XOA) Python API.

Common tests for networks hosting AI applications

Here are common scenarios for using our Ethernet solutions to optimize the performance of networks running AI applications:

Testing scenarios for AI and HPC networking infrastructures

Validating the performance and reliability of AI and HPC networks requires testing across multiple layers of the stack. Modern cluster interconnects operate at extremely high speeds, rely on advanced PHYs and tranceivers, and include complex link‑layer behavior that must be fully characterized before deployment. The following testing scenarios demonstrate how hardware-based test solutions can emulate realistic data‑center conditions and verify end‑to‑end performance.

Physical layer validation

The physical layer for high‑speed electrical and optical interfaces incorporates a wide range of advanced features across the PCS, PMA, and PMD sublayers, including Forward Error Correction, PCS lane distribution, equalization settings, and auto negotiation and link training. These parameters directly affect how reliably 112G and 224G links operate and must be thoroughly tested and tuned to achieve optimal BER performance. Because testing frequently requires alternating between the physical layer and the data link layer, it is particularly valuable to use a traffic generator such as the Z1608 Edun, which combines advanced L1 control with comprehensive L2 capabilities.

Verifying and troubleshooting data and protocols

Test equipment such as the SierraNet M1288 protocol analyzer provides detailed visibility by capturing data in real time, enabling vendors to troubleshoot errors, including those that arise in interoperability scenarios. Typical analyses include examining ECN bits, PFC bits, UEC preamble formats, and control ordered sets at the PCS sublayer, as well as examining the sequence of frames transmitted on a link. The M1288 also allows users to inject errors into frames to evaluate how devices under test handle fault conditions.

Endpoint emulation and traffic load testing

Testing must also support UEC compliance procedures to verify interoperability across switches and NICs. Test equipment such as the Z1600 Edun traffic analyzer can operate as a UEC‑capable endpoint. This enables evaluation of features such as link negotiation using LLDP, link layer retry (LLR), and Credit Based Flow Control (CBFC) in realistic traffic scenarios. By emulating an actual UET device, testers can validate timing, protocol behavior, and system stability under load without requiring access to large‑scale production networks.

Teledyne LeCroy tests AI applications on Ethernet networks

Avoid mis-ordered packets when load balancing

Our solutions for AI infrastructure

Broad range of test modules

Xena offers a selection of test modules for testing all Ethernet speeds from 10Mbps to 1.6Tbps.

Our range includes the latest Ethernet technology with 224G-based SerDes and PAM-4 in our Z1608 Edun product line. For testing 10Mbps to 800Gbps using 10G/25G NRZ and 56G/112G PAM4 we offer our Z800 Freya product line.

Easy-to-use software

Our test solutions include feature-reach software for generating Ethernet traffic flows and performing in-depth analysis of test results. The software solutions are unified across all test modules and speeds, ensuring a consistent user experience regardless of hardware configuration. The primary tool is XenaManager.

There are also test suites for running standard tests such as RFC2544, RFC2889, RFC3918 and Y.1564, and specialized AN/LT tests along with a comprehensive range of powerful scripting and test automation options such as Xena OpenAutomation (XOA), an open-source test automation framework featuring a Python API that runs on any OS.

Robust chassis choices

Choose between the robust scalable 4U XenaBay with space for up to 12 test modules, or the small, easy-to-transport 1U XenaCompact with just one test module.

Exceptional value

All our solutions include the Xena Value Pack which consists of 3 years’ SW updates, 3 years’ HW warranty, free online/email support for the lifetime of the product and free product training.

Together with our low port-pricing, this represents significant savings on the TCO of your Ethernet traffic generation and analysis solutions.

White Papers

… for this industry