Advanced Optimizers
Better optimization algorithms.
AdamW
Adam with weight decay. L2 regularization separate from adaptive learning rate.
LAMB
Layer-wise Adaptive Moments for Batch. Large batch training. Different LR per layer.
Sharpness-Aware Minimization
SAM: seeks flat minima. Adversarial perturbation improves generalization.
Key Takeaways
- AdamW: Adam + proper weight decay
- LAMB: for large batch training
- SAM: improves generalization