Attention in Neural Networks
Attention focuses on relevant parts of input.
Attention Weights
Compute similarity between query and keys. Softmax to get weights. Weight values by attention weights.
Attention(Q, K, V) = softmax(QK^T/√d_k)V. Scaled dot-product attention.
Types
Self-attion: sequence attends to itself. Multi-head attention: multiple attention mechanisms.
Transformer uses multi-head self-attention.
Applications
Machine translation. Text summarization. Image captioning. Question answering.
Key Takeaways
- Attention weights focus on relevant input
- Multi-head attention captures multiple relationships
- Foundation of modern NLP