← Back to Data Science

All Topics

Advertisement

Learn/Data Science/Deep Learning

Distributed Training

Topic: Distributed

Advertisement

Large-Scale Training

Train models across multiple devices.

Data Parallelism

Split data across GPUs. Each processes batch, computes gradient. All-reduce averages gradients.

Synchronous: wait for all. Asynchronous: stale gradients ok.

Model Parallelism

Split model across GPUs. Pipeline parallelism: stages across GPUs.

Tools

Horovod, DeepSpeed, PyTorch Distributed. Mixed precision training: FP16.

Key Takeaways

  1. Data parallelism: split data
  2. Model parallelism: split model
  3. Communication is bottleneck

Advertisement

Advertisement

Need More Practice?

Get personalized data science help from ChatWhole's AI-powered platform.

Get Expert Help →