Transformer Fundamentals
Transformers use attention instead of recurrence.
Architecture
Encoder: multi-head self-attention + feed-forward. Decoder: masked self-attention + encoder-decoder attention.
Positional encoding: add position information to embeddings.
Key Components
Multi-head attention: parallel attention heads. Residual connections: around each sub-layer. Layer normalization.
Feed-forward: two linear layers with ReLU.
BERT and GPT
BERT: bidirectional, masked language modeling. GPT: left-to-right, causal language modeling.
Pre-trained on large corpus, fine-tuned on task.
Key Takeaways
- Transformers use attention instead of RNNs
- Positional encoding adds order
- BERT and GPT are transformer variants