The Future Network: Unlocking the Potential of Training Super-Large-Scale AI Models

From Transformers to the widespread adoption of ChatGPT in 2023, people have come to realize that increasing model parameters can enhance performance, aligning with the scaling law of parameters and performance. Particularly, when the parameter scale exceeds trillions, the language comprehension, logical reasoning, and problem-solving capabilities of large AI models improve rapidly.

To meet the demands of efficient distributed computing in large-scale training clusters, the training process of AI models typically involves various parallel computing modes such as data parallelism, pipeline parallelism, and tensor parallelism. In these parallel modes, collective communication operations among multiple computing devices become crucial. Therefore, designing efficient cluster networking schemes in large-scale training clusters of AI models is key to reducing communication overhead and improving the effective computation-to-communication time ratio of GPUs.

Challenges in Scaling GPU Networks for Efficient Training of Ultra-Large AI Models

The computing demands of artificial intelligence applications are experiencing exponential growth, with model sizes continuously expanding, necessitating significant computational power and high memory requirements. Appropriate parallelization methods such as data, pipeline, and tensor parallelism have become key to improving training efficiency. Training extra-large models requires clusters containing thousands of GPUs, utilizing high-performance GPUs and RDMA protocols to achieve throughputs of 100 to 400 Gbps. Specifically, achieving high-performance interconnection among thousands of GPUs poses several challenges in terms of network scalability:

  • Challenges encountered in large-scale RDMA networks, such as head-of-line blocking and PFC deadlock storms.
  • Network performance optimization, including more effective congestion control and load balancing techniques.
  • Issues with NIC connectivity, as individual hosts are subject to hardware performance limitations. Addressing how to establish thousands of RDMA QP connections.
  • Selection of network topology, considering whether to adopt traditional Fat Tree structures or reference high-performance computing network topologies like Torus or Dragonfly.

Optimizing GPU Communication for Efficient AI Model Training Across Machines

In AI large-scale model training, GPU communication within and across machines generates significant data. With billions of model parameters, communication from parallelism can reach hundreds of GB. Efficient completion relies on GPU communication bandwidth within machines. GPUs should support high-speed protocols to reduce CPU memory copying. PCIe bus bandwidth determines if network card bandwidth is fully utilized. For example, with PCIe 3.0 (16 lanes = 16GB/s), if inter-machine communication has 200Gbps bandwidth, network performance may not be fully utilized.

Crucial Factors in AI Large-Scale Model Training Efficiency

In data communication, network latency comprises two components: static latency and dynamic latency. Static latency includes data serialization, device forwarding, and electro-optical transmission delays, determined by the forwarding chip’s capacity and transmission distance, representing a constant value when network topology and data volume are fixed. In contrast, dynamic latency significantly affects network performance, including queuing delays within switches and delays caused by packet loss and retransmission typically due to network congestion. Besides latency, network fluctuations introduce latency jitter, affecting training efficiency.

Critical for Computational Power in Large-Scale AI Model Training

Cluster computing power is crucial for AI model training speed. Network system reliability forms the foundation of cluster stability. Network failures disrupt computing node connections, impairing overall computing capability. Performance fluctuations may decrease resource utilization. Fault-tolerant replacement or elastic expansion may be necessary to address failed nodes during training tasks. Additionally, unexpected network failures can lead to communication library timeouts, severely impacting efficiency. Therefore, obtaining detailed throughput, packet loss, and other information is vital for fault detection.

The Role of Automated Deployment and Fault Detection in Large-Scale AI Clusters

The establishment of intelligent lossless networks often relies on RDMA protocols and congestion control mechanisms, accompanied by a variety of complex configurations. Any misconfiguration of these parameters can potentially impact network performance and lead to unforeseen issues. Therefore, efficient and automated deployment can effectively enhance the reliability and efficiency of large-scale model cluster systems.

Similarly, in complex architectural and configuration scenarios, timely and accurate fault localization during business operations is crucial for ensuring overall business efficiency. Automated fault detection aids in quickly identifying issues, notifying management accurately, and reducing costs associated with issue identification. It can swiftly identify root causes and provide corresponding solutions.

Large-scale AI models have specific requirements in terms of scale, bandwidth, stability, latency/jitter, and automation capabilities. However, there still exists a technological gap in current data center network configurations to fully meet these requirements.

Al Intelligent Computing Center Network Architecture Design Practice

Traditional cloud data center networks prioritize north-south traffic, leading to congestion, high latency, and bandwidth constraints for east-west traffic. For intelligent computing scenarios, it’s recommended to build dedicated high-performance networks to accommodate workloads, meeting high-bandwidth, low-latency, and lossless requirements.

Based on current mature commercial switches, it is recommended to consider different models of InfiniBand/RoCE switches and the supported GPU scale to set the following specifications for physical network architecture:

Standard: Based on InfiniBand HDR switches, a dual-layer Fat-Tree network architecture supports up to 800 GPU cards per cluster.

Large-scale: Based on 128-port 100G Ethernet switches, a RoCE dual-layer Fat-Tree network architecture supports up to 8192 GPU cards per cluster.

Extra-large: Based on InfiniBand HDR switches, an InfiniBand three-layer Fat-Tree network architecture supports up to 16000 GPU cards per cluster.

Extra-extra-large: Based on InfiniBand Quantum-2 switches or equivalent Ethernet data center switches, adopting a three-layer Fat-Tree network architecture supports up to 100000 GPU cards per cluster.

In addition, high-speed network connections are crucial for ensuring efficient data transmission and processing.

How FS Can Help

FS provides high-quality connectivity products to meet the demands of AI model network deployment. The FS product portfolio includes (200G, 400G) InfiniBand switches, data center switches (10G, 40G, 100G, 400G) network cards, and (10/25G, 40G, 50/56G, 100G) optical modules, accelerating AI model training and inference processes. Optical modules offer high bandwidth, low latency, and low error rates, enhancing data center network capabilities for faster and more efficient AI computing. For more information, please visit the FS website.