본문으로 건너뛰기

Demystifying NCCL(2507) - An In-depth Analysis of GPU Communication Protocols and Algorithms

Data-Transfer Methods and Transport Layer

Intra-NodeInter-Node
TransportP2P p2p.cc
SHM shm.cc
NVLS nvls.cc
NET net_ib.cc, net_socket.cc
COLLNET coll_net.cc
Physical
Interconnect
NVLink
PCIe
InfiniBand
RoCE
TCP/IP (Socket)
OptimizationsGPUDirect P2P
P2P_DIRECT
GPUDirect RDMA

Intra-Node 통신에서 RDMA가 사용될 수 있고, Inter-Node 통신에서 NVLink가 사용될 수 있습니다.

Intra-node Data Transfer

Figure 1:Illustration of intra-node data transfer paths in NCCL.

Inter-node Data Transfer

Figure 2:Illustration of inter-node data transfer paths in NCCL.

NCCL Collective Algorithms

Qualitative Algorithm Analysis

Non-pipelined Pattern

Figure 4:Illustration of the Ring AllReduce algorithm in NCCL.

k 개의 GPU가 참여하는 Ring AllReduce는 2k-1 개의 step으로 구성됩니다.

Step IndexNCCL Primitive
0send
1 to k-2recvReduceSend
k-1recvReduceCopySend
k to 2k-3recvCopySend
2k-2recv

Pipelined Pattern

Figure 5:Illustration of the Tree AllReduce algorithm in NCCL.

GPU RolePrimitives
RootrecvReduceCopySend
MiddlerecvReduceSend and then recvCopySend
Leafsend and then recv