Transformer Architecture

Topic: Transformers

Transformer Fundamentals

Transformers use attention instead of recurrence.

Encoder: multi-head self-attention + feed-forward. Decoder: masked self-attention + encoder-decoder attention.

Positional encoding: add position information to embeddings.

Multi-head attention: parallel attention heads. Residual connections: around each sub-layer. Layer normalization.

Feed-forward: two linear layers with ReLU.

BERT: bidirectional, masked language modeling. GPT: left-to-right, causal language modeling.

Pre-trained on large corpus, fine-tuned on task.

Get personalized data science help from ChatWhole's AI-powered platform.