Transformer Architecture
https://arxiv.org/pdf/1706.03762.pdf
https://charon.me/posts/pytorch/pytorch_seq2seq_6/
Self-Attention
- Input: X={x1,x2,⋯,xn}∈Rn×d
- Query: qi=Wqxi∈Rdk
- Key: ki=Wkxi∈Rdk
- Value: vi=Wvxi∈Rdv
- Attention: aij=softmax(dkqiTkj)=∑j=1nexp(dkqiTkj)exp(dkqiTkj)
- Output: yi=∑j=1naijvj
- Positional Encoding: pi={sin(100002∗1/di),cos(100002∗1/di),sin(100002∗2/di),cos(100002∗2/di),⋯}∈Rd
...