Learning from Multiple Modalities
Combine text, images, audio.
Fusion Methods
Early fusion: combine raw data. Late fusion: combine outputs. Cross-modal attention: attend across modalities.
Models
CLIP: contrastive language-image. GPT-4V: vision + language. Flamingo: few-shot.
Applications
Image captioning. Visual QA. Video understanding. Text-to-image.
Key Takeaways
- Combine different data types
- CLIP enables zero-shot image classification
- Multimodal models dominate