Evaluating Language Models
Assess LLM capabilities.
Benchmarks
MMLU: multi-task. HumanEval: coding. BigBench: diverse tasks. HELM: comprehensive.
Metrics
Accuracy. BLEU, ROUGE for generation. Perplexity. Latency.
Challenges
Few-shot evaluation. Domain-specific. Bias detection. Human evaluation needed.
Key Takeaways
- Multiple benchmarks measure capabilities
- Metrics: accuracy, perplexity
- Human evaluation often needed