Mastering the Basics of GPU Computing

It’s known that training large models is done on clusters of machines with preferably many GPUs per server. This article will introduce the professional terminology and common network architecture of GPU computing.

Exploring Key Components in GPU Computing

PCIe Switch Chip

In the domain of high-performance GPU computing, vital elements such as CPUs, memory modules, NVMe storage, GPUs, and network cards establish fluid connections via the PCIe (Peripheral Component Interconnect Express) bus or specialized PCIe switch chips.


NVLink is a wire-based serial multi-lane near-range communications link developed by Nvidia. Unlike PCI Express, a device can consist of mu─▒ltiple NVLinks, and devices use mesh networking to communicate instead of a central hub. The protocol was first announced in March 2014 and uses proprietary high-speed signaling interconnect (NVHS).

The technology supports full mesh interconnection between GPUs on the same node. And the development from NVLink 1.0, NVLink 2.0, NVLink 3.0 to NVLink 4.0 has significantly enhanced the two-way bandwidth and improved the performance of GPU computing applications.


NVSwitch is a switching chip developed by NVIDIA, designed specifically for high-performance computing and artificial intelligence applications. Its primary function is to provide high-speed, low-latency communication between multiple GPUs within the same host.

NVLink Switch

Unlike the NVSwitch, which is integrated into GPU modules within a single host, the NVLink Switch serves as a standalone switch specifically engineered for linking GPUs in a distributed computing environment.


Several GPU manufacturers have taken innovative ways to address the speed bottleneck by stacking multiple DDR chips to form so-called high-bandwidth memory (HBM) and integrating them with the GPU. This design removes the need for each GPU to traverse the PCIe switch chip when engaging its dedicated memory. As a result, this strategy significantly increases data transfer speeds, potentially achieving significant orders of magnitude improvements.

Bandwidth Unit

In large-scale GPU computing training, performance is directly tied to data transfer speeds, involving pathways such as PCIe, memory, NVLink, HBM, and network bandwidth. Different bandwidth units are used to measure these data rates.

Storage Network Card

The storage network card in GPU architecture connects to the CPU via PCIe, enabling communication with distributed storage systems. It plays a crucial role in efficient data reading and writing for deep learning model training. Additionally, the storage network card handles node management tasks, including SSH (Secure Shell) remote login, system performance monitoring, and collecting related data. These tasks help monitor and maintain the running status of the GPU cluster.

For the above in-depth exploration of various professional terms, you can refer to this article Unveiling the Foundations of GPU Computing-1 from FS community.

High-Performance GPU Fabric

NVSwitch Fabric

In a full mesh network topology, each node is connected directly to all the other nodes. Usually, 8 GPUs are connected in a full-mesh configuration through six NVSwitch chips, also referred to as NVSwitch fabric.

This fabric optimizes data transfer with a bidirectional bandwidth, providing efficient communication between GPUs and supporting parallel computing tasks. The bandwidth per line depends on the NVLink technology utilized, such as NVLink3, enhancing the overall performance in large-scale GPU clusters.

IDC GPU Fabric

The fabric mainly includes computing network and storage network. The computing network is mainly used to connect GPU nodes and support the collaboration of parallel computing tasks. This involves transferring data between multiple GPUs, sharing calculation results, and coordinating the execution of massively parallel computing tasks. The storage network mainly connects GPU nodes and storage systems to support large-scale data read and write operations. This includes loading data from the storage system into GPU memory and writing calculation results back to the storage system.

Want to know more about CPU fabric? Please check this article Unveiling the Foundations of GPU Computing-2 from FS community.