BERT Architecture
Bidirectional Encoder Representations.
Pre-training
Masked Language Modeling (MLM). Next Sentence Prediction. Deep bidirectional.
Fine-Tuning
Add task head. Train end-to-end. Works for classification, QA.
Variants
RoBERTa: more data, better training. ALBERT: parameter sharing. DistilBERT: knowledge distillation.
Key Takeaways
- BERT is bidirectional transformer
- MLM pre-training
- Many variants improve on BERT