Demystifying NCCL(2507) - An In-depth Analysis of GPU Communication Protocols and Algorithms
Data-Transfer Methods and Transport Layer
| Intra-Node | Inter-Node | |
|---|---|---|
| Transport | P2P p2p.cc SHM shm.cc NVLS nvls.cc | NET net_ib.cc, net_socket.cc COLLNET coll_net.cc |
| Physical Interconnect | NVLink PCIe | InfiniBand RoCE TCP/IP (Socket) |
| Optimizations | GPUDirect P2P P2P_DIRECT | GPUDirect RDMA |
Intra-Node 통신에서 RDMA가 사용될 수 있고, Inter-Node 통신에서 NVLink가 사용될 수 있습니다.
Intra-node Data Transfer

Figure 1:Illustration of intra-node data transfer paths in NCCL.
Inter-node Data Transfer

Figure 2:Illustration of inter-node data transfer paths in NCCL.
NCCL Collective Algorithms
Qualitative Algorithm Analysis
Non-pipelined Pattern

Figure 4:Illustration of the Ring AllReduce algorithm in NCCL.
k 개의 GPU가 참여하는 Ring AllReduce는 2k-1 개의 step으로 구성됩니다.
| Step Index | NCCL Primitive |
|---|---|
| 0 | send |
| 1 to k-2 | recvReduceSend |
| k-1 | recvReduceCopySend |
| k to 2k-3 | recvCopySend |
| 2k-2 | recv |
Pipelined Pattern

Figure 5:Illustration of the Tree AllReduce algorithm in NCCL.
| GPU Role | Primitives |
|---|---|
| Root | recvReduceCopySend |
| Middle | recvReduceSend and then recvCopySend |
| Leaf | send and then recv |