← Back to Data Science

All Topics

Advertisement

Learn/Data Science/Deep Learning

Transformer Architecture

Topic: Transformers

Advertisement

Transformer Fundamentals

Transformers use attention instead of recurrence.

Architecture

Encoder: multi-head self-attention + feed-forward. Decoder: masked self-attention + encoder-decoder attention.

Positional encoding: add position information to embeddings.

Key Components

Multi-head attention: parallel attention heads. Residual connections: around each sub-layer. Layer normalization.

Feed-forward: two linear layers with ReLU.

BERT and GPT

BERT: bidirectional, masked language modeling. GPT: left-to-right, causal language modeling.

Pre-trained on large corpus, fine-tuned on task.

Key Takeaways

  1. Transformers use attention instead of RNNs
  2. Positional encoding adds order
  3. BERT and GPT are transformer variants

Advertisement

Advertisement

Need More Practice?

Get personalized data science help from ChatWhole's AI-powered platform.

Get Expert Help →