Enhancing Data Center Networks with InfiniBand Solutions

With the rapid growth of data centers driven by expansive models, cloud computing, and big data analytics, there is an increasing demand for high-speed data transfer and low-latency communication. In this complex network ecosystem, InfiniBand (IB) technology has become a market leader, playing a vital role in addressing the challenges posed by the training and deployment of expansive models. Constructing high-speed networks within data centers requires essential components such as high-rate network cards, optical modules, switches, and advanced network interconnect technologies.

NVIDIA Quantum™-2 InfiniBand Switch

When selecting switches, NVIDIA’s QM9700 and QM9790 series stand out as the most advanced devices. Built on NVIDIA Quantum-2 architecture, they offer 64 NDR 400Gb/s InfiniBand ports within a standard 1U chassis. This breakthrough translates to an individual switch providing a total bidirectional bandwidth of 51.2 terabits per second (Tb/s), along with an unprecedented handling capacity exceeding 66.5 billion packets per second (BPPS).

The NVIDIA Quantum-2 InfiniBand switches extend beyond their NDR high-speed data transfer capabilities, incorporating extensive throughput, on-chip compute processing, advanced intelligent acceleration features, adaptability, and sturdy construction. These attributes establish them as the quintessential selections for sectors involving high-performance computing (HPC), artificial intelligence, and expansive cloud-based infrastructures. Additionally, the integration of NDR switches helps minimize overall expenses and complexity, propelling the progression and evolution of data center network technologies.

It can be said that NVIDIA Quantum-2 InfiniBand switches not only feature high-speed NDR data transfer capabilities but also integrate extensive throughput, on-chip compute processing, advanced intelligent acceleration features, and robust structure. These attributes make them a typical choice in the realm of High-Performance Computing (HPC), Artificial Intelligence, and a wide range of cloud-based infrastructure applications. Moreover, the integration of NDR switches helps minimize overall cost and complexity, thereby promoting the development of data center network technology.

Also Check- Revolutionizing Data Center Networks: 800G Optical Modules and NDR Switches | FS Community

ConnectX®-7 InfiniBand Card

The NVIDIA ConnectX®-7 InfiniBand network card (HCA) ASIC delivers a staggering data throughput of 400Gb/s, supporting 16 lanes of PCIe 5.0 or PCIe 4.0 host interface. Utilizing advanced SerDes technology with 100Gb/s per lane, the 400Gb/s InfiniBand is achieved through OSFP connectors on both the switch and HCA ports. The OSFP connector on the switch supports two 400Gb/s InfiniBand ports or 200Gb/s InfiniBand ports, while the network card HCA features one 400Gb/s InfiniBand port. The product range includes active and passive copper cables, transceivers, and MPO fiber cables. Notably, despite both using OSFP packaging, there are differences in physical dimensions, with the switch-side OSFP module equipped with heat fins for cooling.

OSFP 800G Optical Transceiver

The OSFP-800G SR8 Module is designed for use in 800Gb/s 2xNDR InfiniBand systems throughput up to 30m over OM3 or 50m over OM4 multimode fiber (MMF) using a wavelength of 850nm via dual MTP/MPO-12 connectors. The dual-port design is a key innovation that incorporates two internal transceiver engines, fully unleashing the potential of the switch. This allows the 32 physical interfaces to provide up to 64 400G NDR interfaces. This high-density and higgh-bandwidth design enables data centers to meet the growing network demands and requirements of applications such as high-performance computing artificial intelligence, and cloud infrastructure.

FS’s OSFP-800G SR8 Module offers superior performance and dependability, offering strong optical interconnection options for data centers. This module empowers data centers to harness the full performance capabilities of the QM9700/9790 series switch, supporting the transmission of data with both high bandwidth and low latency.

NDR Optical Connection Solution

Addressing the NDR optical connection challenge, the NDR switch ports utilize OSFP with eight channels per interface, each employing 100Gb/s SerDes. This allows for three mainstream connection speed options: 800G to 800G, 800G to 2X400G, and 800G to 4X200G. Additionally, each channel supports downgrade from 100Gb/s to 50Gb/s, facilitating interoperability with previous-generation HDR devices. The 400G NDR series cables and transceivers offer diverse product choices for configuring network switch and adapter systems, focusing on data center lengths of up to 500 meters to accelerate AI computing systems. The various connector types, including passive copper cables (DAC), active optical cables (AOC), and optical modules with jumpers, cater to different transmission distances and bandwidth requirements, ensuring low latency and an extremely low bit error rate for high-bandwidth AI and accelerated computing applications. Please see the article Infiniband NDR OSFP Solution for deployment details from FS community.

Revolutionize High-Performance Computing with RDMA

To address the efficiency challenges of rapidly growing data storage and retrieval within data centers, the use of Ethernet-converged distributed storage networks is becoming increasingly popular. However, in storage networks where data flows are mainly characterized by large flows, packet loss caused by congestion will reduce data transmission efficiency and aggravate congestion. In order to solve this series of problems, RDMA technology emerged as the times require.

What is RDMA?

RDMA (Remote Direct Memory Access) is an advanced technology designed to reduce the latency associated with server-side data processing during network transfers. Allowing user-level applications to directly read from and write to remote memory without involving the CPU in multiple memory copies, RDMA bypasses the kernel and writes data directly to the network card. This achieves high throughput, ultra-low latency, and minimal CPU overhead. Presently, RDMA’s transport protocol over Ethernet is RoCEv2 (RDMA over Converged Ethernet v2). RoCEv2, a connectionless protocol based on UDP (User Datagram Protocol), is faster and consumes fewer CPU resources compared to the connection-oriented TCP (Transmission Control Protocol).

Building Lossless Network with RDMA

The RDMA networks achieve lossless transmission through the deployment of PFC and ECN functionalities. PFC technology controls RDMA-specific queue traffic on the link, applying backpressure to upstream devices during congestion at the switch’s ingress port. With ECN technology, end-to-end congestion control is achieved by marking packets during congestion at the egress port, prompting the sending end to reduce its transmission rate.

Optimal network performance is achieved by adjusting buffer thresholds for ECN and PFC, ensuring faster triggering of ECN than PFC. This allows the network to maintain full-speed data forwarding while actively reducing the server’s transmission rate to address congestion.

Accelerating Cluster Performance with GPU Direct-RDMA

The traditional TCP network heavily relies on CPU processing for packet management, often struggling to fully utilize available bandwidth. Therefore, in AI environments, RDMA has become an indispensable network transfer technology, particularly during large-scale cluster training. It surpasses high-performance network transfers in user space data stored in CPU memory and contributes to GPU transfers within GPU clusters across multiple servers. And the Direct-RDMA technology is a key component in optimizing HPC/AI performance, and NVIDIA enhances the performance of GPU clusters by supporting the function of GPU Direct-RDMA.

Streamlining RDMA Product Selection

In building high-performance RDMA networks, essential elements like RDMA adapters and powerful servers are necessary, but success also hinges on critical components such as high-speed optical modules, switches, and optical cables. As a leading provider of high-speed data transmission solutions, FS offers a diverse range of top-quality products, including high-performance switches, 200/400/800G optical modules, smart network cards, and more. These are precisely designed to meet the stringent requirements of low-latency and high-speed data transmission.

InfiniBand: Powering High-Performance Data Centers

Driven by the booming development of cloud computing and big data, InfiniBand has become a key technology and plays a vital role at the core of the data center. But what exactly is InfiniBand technology? What attributes contribute to its widespread adoption? The following guide will answer your questions.

What is InfiniBand?

InfiniBand is an open industrial standard that defines a high-speed network for interconnecting servers, storage devices, and more. It leverages point-to-point bidirectional links to enable seamless communication between processors located on different servers. It is compatible with various operating systems such as Linux, Windows, and ESXi.

InfiniBand Network Fabric

InfiniBand, built on a channel-based fabric, comprises key components like HCA (Host Channel Adapter), TCA (Target Channel Adapter), InfiniBand links (connecting channels, ranging from cables to fibers, and even on-board links), and InfiniBand switches and routers (integral for networking). Channel adapters, particularly HCA and TCA, are pivotal in forming InfiniBand channels, ensuring security and adherence to Quality of Service (QoS) levels for transmissions.

InfiniBand vs Ethernet

InfiniBand was developed to address data transmission bottlenecks in high-performance computing clusters. The primary differences with Ethernet lie in bandwidth, latency, network reliability, and more.

High Bandwidth and Low Latency

InfiniBand provides higher bandwidth and lower latency, meeting the performance demands of large-scale data transfer and real-time communication applications.

RDMA Support

InfiniBand supports Remote Direct Memory Access (RDMA), enabling direct data transfer between node memories. This reduces CPU overhead and improves transfer efficiency.

Scalability

InfiniBand Fabric allows for easy scalability by connecting a large number of nodes and supporting high-density server layouts. Additional InfiniBand switches and cables can expand network scale and bandwidth capacity.

High Reliability

InfiniBand Fabric incorporates redundant designs and fault isolation mechanisms, enhancing network availability and fault tolerance. Alternate paths maintain network connectivity in case of node or connection failures.

Conclusion

The InfiniBand network has undergone rapid iterations, progressing from SDR 10Gbps, DDR 20Gbps, QDR 40Gbps, FDR56Gbps, EDR 100Gbps, and now to HDR 200Gbps and NDR 400Gbps/800Gbps InfiniBand. For those considering the implementation of InfiniBand products in their high-performance data centers, further details are available from FS.com.