Demystifying NCCL(2507) - An In-depth Analysis of GPU Communication Protocols and Algorithms
Data-Transfer Methods and Transport Layer
Intra-Node | Inter-Node | |
---|---|---|
Transport | P2P p2p.cc SHM shm.cc NVLS nvls.cc | NET net_ib.cc, net_socket.cc COLLNET coll_net.cc |
Physical Interconnect | NVLink PCIe | InfiniBand RoCE TCP/IP (Socket) |
Optimizations | GPUDirect P2P P2P_DIRECT | GPUDirect RDMA |
Intra-Node 통신에서 RDMA가 사용될 수 있고, Inter-Node 통신에서 NVLink가 사용될 수 있습니다.
Intra-node Data Transfer

Figure 1:Illustration of intra-node data transfer paths in NCCL.
Inter-node Data Transfer

Figure 2:Illustration of inter-node data transfer paths in NCCL.
NCCL Collective Algorithms
Qualitative Algorithm Analysis
Non-pipelined Pattern

Figure 4:Illustration of the Ring AllReduce algorithm in NCCL.
k 개의 GPU가 참여하는 Ring AllReduce는 2k-1 개의 step으로 구성됩니다.
Step Index | NCCL Primitive |
---|---|
0 | send |
1 to k-2 | recvReduceSend |
k-1 | recvReduceCopySend |
k to 2k-3 | recvCopySend |
2k-2 | recv |
Pipelined Pattern

Figure 5:Illustration of the Tree AllReduce algorithm in NCCL.
GPU Role | Primitives |
---|---|
Root | recvReduceCopySend |
Middle | recvReduceSend and then recvCopySend |
Leaf | send and then recv |