본문으로 건너뛰기

DeepSeek-V3(2412)

DeepSeek-V3

loading...

DeepSeek-V3 Transformer Block

DeepSeek-V3는 Masked Self Attention으로 Multi-Head Latent Attention을 사용하고, Feed-Forward Network으로 DeepSeekMoE를 사용하는 Transformer Block을 쌓은 Decoder-Only Transformer 구조의 LLM입니다.

Rotary Positional Embedding(RoPE)

Multi-Head Latent Attention(MLA)


Figure 2. Multi-Head Latent Attention(MLA)

Latent를 도입하여 MHA를 사용할 때 보다 연산 효율, 메모리 사용량, KV cache 크기 등에서 향상된 성능을 보이는 Attention입니다.

  • Latent
    • CKV=XWDKVRn×dc (dcdhnh)C^{KV} = X W^{DKV} \in \mathbb{R}^{n \times d_{c}} ~ (d_c \ll d_h n_h)
    • CQ=XWDQRn×dc (dcdhnh)C^Q = X W^{DQ} \in \mathbb{R}^{n \times d_c'} ~ (d_c' \ll d_h n_h)
  • Query
    • QiC=CQWiUQRn×dhnhQ_i^C = C^Q W_i^{UQ} \in \mathbb{R}^{n \times d_h n_h}
    • QiR=RoPE(CQWiQR)Rn×dhRnhQ_i^R = \text{RoPE}(C^Q W_i^{QR}) \in \mathbb{R}^{n \times d_h^R n_h}
    • Qi=Concat(QiC,QiR)Rn×(dh+dhR)nhQ_i = \text{Concat}(Q_i^C, Q_i^R) \in \mathbb{R}^{n \times (d_h + d_h^R) n_h}
  • Key
    • KiC=CKVWiUKRn×dhnhK_i^C = C^{KV} W_i^{UK} \in \mathbb{R}^{n \times d_h n_h}
    • KR=KiR=RoPE(CKVWKR)Rn×dhRnhK^R = K_i^R = \text{RoPE}(C^{KV} W^{KR}) \in \mathbb{R}^{n \times d_h^R n_h}
    • Ki=Concat(KiC,KiR)Rn×(dh+dhR)nhK_i = \text{Concat}(K_i^C, K_i^R) \in \mathbb{R}^{n \times (d_h + d_h^R) n_h}
  • Value
    • ViC=CKVWiUVRn×dhnhV_i^C = C^{KV} W_i^{UV} \in \mathbb{R}^{n \times d_h n_h}

Mixture of Experts(MoE)


Figure 2. DeepSeekMoE
  • sit=Sigmoid(xtTei)s_{it} = \text{Sigmoid}(x_t^T e_i)
  • git={sit,sitTopk({sjt1jNr},Kr)0, otherwiseg_{it}' = \begin{cases}s_{it} & , s_{it} \in \text{Topk}(\{s_{jt} | 1 \leqslant j \leqslant N_r\}, K_r) \\ 0 & \text{, otherwise} \end{cases}
  • git=gitj=1Nrgjtg_{it} = \dfrac{g_{it}'}{\sum_{j=1}^{N_r} g_{jt}'}
  • MoE(xt)=xt+i=1NsFFNis(xt)+i=1NrgitFFNir(xt)\text{MoE}(x_t) = x_t + \sum_{i=1}^{N_s} \text{FFN}_i^s(x_t) + \sum_{i=1}^{N_r} g_{it} \text{FFN}_i^r(x_t)


Figure 3. Illustration of scaling of Transformer Encoder with MoE Layers (GShard)