Reducing Model Size
Quantize weights to lower precision.
Types
Post-training quantization: after training. Quantization-aware training: during training.
Precision
FP32: 32-bit float. FP16: 16-bit float. INT8: 8-bit integer. Binary: 1-bit.
Methods
Dynamic quantization: only activations. Static: calibrates with data.
Key Takeaways
- Reduce precision to compress
- INT8 often sufficient
- Post-training or aware training