Large-Scale Training
Train models across multiple devices.
Data Parallelism
Split data across GPUs. Each processes batch, computes gradient. All-reduce averages gradients.
Synchronous: wait for all. Asynchronous: stale gradients ok.
Model Parallelism
Split model across GPUs. Pipeline parallelism: stages across GPUs.
Tools
Horovod, DeepSpeed, PyTorch Distributed. Mixed precision training: FP16.
Key Takeaways
- Data parallelism: split data
- Model parallelism: split model
- Communication is bottleneck