AI Infrastructure

Testing networks hosting AI applications

AI has entered the data center

The proliferation of Artificial Intelligence (AI) applications is driving the need optimize data centers to support their specific needs. AI algorithms, especially in deep learning, require significant computational resources and specialized AI infrastructure. These applications also depend on vast datasets and real-time processing capabilities. To meet these needs, data centers use AI accelerators, GPUs, and RDMA technologies, along with high-speed Ethernet connectivity.

Spine and leaf structure

Often comprising thousands of interconnected AI accelerators, AI networks must support heavy east-west (server-to-server) lossless traffic with low latency, as well as best-effort traffic.

A Spine and Leaf structure facilitates this. Leaf switches connect directly to servers, storage devices, and other endpoints and each leaf switch is connected to every other spine switch, ensuring high bandwidth and low latency. Spine switches interconnect all the leaf switches and the communication between leaf switches but don’t connect to endpoints directly. This:

  • minimizes the number of hops between servers – packets will only be routed through two leaf switches and one spine switch to provide predictable latency.
  • connects every leaf switch directly to all spine switches, so network is non-blocking.
  • inherently supports load balancing with traffic distributed evenly to prevent bottlenecks and ensure high throughput.
  • scales easily by adding more spine and/or leaf switches.

Building a large “spine & leaf” network of switches for testing purposes is as impractical as it is expensive. However, using traffic generators, impairment emulators & protocol analyzers is an easy and cost-effective alternative.

Remote Direct Memory Access over Ethernet (RoCE)

RDMA over Converged Ethernet (RoCE) is a protocol facilitating low-latency, high-throughput data transfers over Ethernet networks. RoCE enables direct memory access between AI accelerators and storage servers, minimizing CPU involvement and reducing latency.

Managing congestion is essential in AI infrastruture

Detecting and avoiding congestion is crucial for maintaining end-to-end low-latency and lossless performance in AI infrastructures. Congestion control is managed through Priority Flow Control (PFC) and Explicit Congestion Notification (ECN).

ECN gives an early indication that a queue is starting to fill up and avoids congestion by slowing down the traffic rate of the relevant CoS from the sender. PFC takes congestion control one step further by temporarily stopping traffic of the CoS from the sender. Optimizing an AI network’s performance will typically involve tuning the thresholds Kmin and Kmax for ECN.

How to test congestion in an AI infrastructure

The Z800 Freya Ethernet Traffic Generator can be used to verify if a switch under test handles congestion correctly, and to fine-tune the ECN thresholds for optimal performance. The traffic generators can vary the traffic rates for different priority flows entering the switch’s queues and verify the assertion of PFC, as well as the ratio between packets marked ECN=‘11’ and ‘10’.

Automating advanced test cases and integrating them with your own scripts is easy using our open-source Xena OpenAutomation (XOA) Python API.

Common tests for networks hosting AI applications

Here are common scenarios for using our Ethernet solutions to optimize the performance of networks running AI applications:

Loop-back Testing

When testing e.g. NICs, the transmitter is connected to the receiver via a network, as shown opposite. By inserting an Impairment Emulator between the Tx and Rx, it is possible to alter the flows by, for instance, mis-ordering packets and varying the latency.

At the same time, a Protocol Analyzer can be used to capture the packet headers and analyze various fields, such as the BTH+ 24-bit Packet Sequence Number.

Traffic Load Testing

Use e.g. Z800 Freya Traffic Generator to send combinations of RoCEv2 and other types of traffic to the NIC to simulate real-world situations. Link utilization can be varied up to 100% to test how the NIC performs under full load.

Use E100 Impairment Emulator to e.g. filter RoCEv2 packets and impair them by mis-ordering the packets, adding latency, or emulating a short link breakage – to assess throughput and protocol behavior. Or test SmartNIC’s capabilities to reorder packets and handle latency variations.

Use M1288 Protocol Analyzer for insight into all details of the packet headers (crucial for debugging protocols and finding the root cause of any malfunction).

Traffic Jamming

Use the Z800 Traffic Generator to create various traffic flows and M1288 Jammer to alter or corrupt traffic in both directions. Altering fields in the headers of RoCEv2 packets simulates

rewrite operations that typically take place as the packets traverse the network. Jammer is ideal for stress-testing systems at wire-speed and various protocols to identify issues missed using simulated workloads.

Testing typical impairments in AI scenarios

The E100 Chimera Network Emulator tests the performance impact of typical AI network impairments such as varying latency, jitter, link flap and packet mis-ordering. As shown opposite, simply insert the E100 Chimera between a traffic generator (or a NIC card), and the switch you want to test.

Avoid mis-ordered packets when load balancing

Our solutions for AI infrastructure

Broad range of test modules

Xena offers a selection of test modules for testing all Ethernet speeds from 10Mbps to 800Gbps.

Our range includes the latest Ethernet technology with 112G-based SerDes and PAM-4 in our Z800 Freya product line. For testing and debugging AN/LT protocols we offer our dedicated Z800 Freya Compact ANLT Test Appliance.

Easy-to-use software

Xena’s test solutions include feature-rich software for generating Ethernet traffic and analyzing the result. The primary tool is XenaManager.

There are also test suites for running standard tests such as RFC2544, RFC2889, RFC3918 and Y.1564, and specialized AN/LT tests along with a comprehensive range of powerful scripting and test automation options such as Xena OpenAutomation (XOA), an open-source test automation framework featuring a Python API that runs on any OS.

Robust chassis choices

Choose between the robust scalable 4U XenaBay with space for up to 12 test modules, or the small, easy-to-transport 1U XenaCompact with just one test module.

Exceptional value

All our solutions include the Xena Value Pack which consists of 3 years’ SW updates, 3 years’ HW warranty, free online/email support for the lifetime of the product and free product training.

Together with our low port-pricing, this represents significant savings on the TCO of your Ethernet traffic generation and analysis solutions.

Xena Solutions

… for testing AI infrastructure

White Papers

… for this industry

Is Your Network Ready For AI?

00
Months
00
Days
00
Hours
00
Minutes
00
Seconds