← Back to Data Science

All Topics

Advertisement

Learn/Data Science/Deep Learning

Model Quantization

Topic: Quantization

Advertisement

Reducing Model Size

Quantize weights to lower precision.

Types

Post-training quantization: after training. Quantization-aware training: during training.

Precision

FP32: 32-bit float. FP16: 16-bit float. INT8: 8-bit integer. Binary: 1-bit.

Methods

Dynamic quantization: only activations. Static: calibrates with data.

Key Takeaways

  1. Reduce precision to compress
  2. INT8 often sufficient
  3. Post-training or aware training

Advertisement

Advertisement

Need More Practice?

Get personalized data science help from ChatWhole's AI-powered platform.

Get Expert Help →