Overfitting and Underfitting

Understanding Model Fit

Model fit describes how well a model captures patterns in training data. The fundamental challenge is that training data is a sample from a larger population. A model that fits training data perfectly might not generalize to new data.

The relationship between model complexity and generalization is central to machine learning. Too simple models fail to capture important patterns (underfitting). Too complex models fit noise rather than signal (overfitting).

Understanding and managing this tradeoff is essential for building models that generalize well.

Underfitting

Underfitting occurs when models are too simple to capture patterns in the data. Both training and test performance are poor.

Symptoms of Underfitting

Training performance is poor. The model fails to learn basic patterns in the data. Errors are high even on training data.

Test performance is similar to training performance. Both are poor because the model cannot capture the underlying structure.

The model has high bias. It makes strong assumptions that are wrong for the data.

Causes of Underfitting

Insufficient features provide little information for prediction. The model cannot learn relationships that depend on missing variables.

Excessive regularization prevents the model from fitting patterns. Strong penalties might constrain the model too much.

Insufficient model capacity prevents capturing complex patterns. Linear models might be too simple for non-linear relationships.

Insufficient training prevents the model from learning. Not enough data or too few training iterations.

Addressing Underfitting

Add more informative features. Feature engineering might create predictive variables that the model can use.

Reduce regularization strength. Less penalty allows more complex patterns to be fit.

Increase model complexity. More flexible models can capture more patterns. More layers in neural networks, more trees in forests.

Train longer. More iterations allow better convergence.

Overfitting

Overfitting occurs when models capture noise rather than signal. Training performance is excellent but test performance is poor.

Symptoms of Overfitting

Training performance is excellent. Errors are very low. The model appears to fit perfectly.

Test performance is much worse than training. The gap between train and test indicates overfitting.

The model has high variance. Small changes in training data lead to very different models.

Causes of Overfitting

Excessive model complexity fits noise rather than signal. More parameters than the data can support.

Insufficient regularization allows too much flexibility. Without penalty, models can approximate any function.

Insufficient training data doesn't constrain model complexity. More parameters than data points is a warning sign.

Training too long might fit noise. Early stopping might help.

Addressing Overfitting

Reduce model complexity. Simpler models are less likely to fit noise. Fewer parameters, shallower trees, fewer layers.

Increase regularization. Stronger penalties constrain flexibility. L1, L2, dropout.

Add training data. More data provides more constraints on what the model can learn.

Use validation data to stop training early. Monitor validation performance and stop when it starts degrading.

The Bias-Variance Tradeoff

The bias-variance tradeoff describes the relationship between model complexity and generalization.

Bias

Bias is error from overly simplistic models. High-bias models make strong assumptions that might be wrong. They underfit.

High bias shows as poor training performance. The model cannot capture patterns. Simple linear models on non-linear data have high bias.

Reducing bias typically requires more complex models. This might increase variance.

Variance

Variability is error from models being too sensitive to training data. High-variance models change substantially with different training data. They overfit.

High variance shows as large differences between training and test performance. The model fits training data perfectly but fails on new data.

Reducing variance typically requires simpler models or more data. This might increase bias.

Optimal Complexity

The optimal model complexity balances bias and variance. Neither too simple (high bias) nor too complex (high variance).

The optimal depends on data quantity and complexity. More data supports more complex models. Complex patterns require more complex models.

Cross-validation estimates generalization performance for different complexities. The complexity with best validation performance is optimal.

Visualizing the Tradeoff

Visualizations help understand the bias-variance tradeoff.

Learning Curves

Learning curves plot performance versus training set size. Both train and test performance improve with more data.

The gap between train and test shows overfitting. Large gaps indicate high variance. Both curves being high shows underfitting.

The curves plateau when more data doesn't help. This shows what additional data can achieve.

Validation Curves

Validation curves plot performance versus model complexity parameter (like tree depth, regularization strength).

U-shaped validation curves show the tradeoff. Low complexity (high bias), high complexity (high variance), optimal in between.

Finding the minimum of validation curves identifies optimal complexity.

Regularization

Regularization adds penalty for complexity to the training objective. This constrains models and reduces overfitting.

L1 Regularization (LASSO)

L1 penalty adds sum of absolute coefficient values to the objective. This can shrink coefficients to exactly zero.

LASSO performs variable selection automatically. Features with zero coefficients are dropped.

The penalty strength controls amount of shrinkage. Cross-validation selects optimal strength.

L2 Regularization (Ridge)

L2 penalty adds sum of squared coefficient values to the objective. This shrinks coefficients toward zero but rarely to exactly zero.

Ridge handles correlated features well. It doesn't drop features but reduces their influence.

The penalty strength controls shrinkage. Cross-validation selects optimal strength.

Elastic Net

Elastic net combines L1 and L2 penalties. It combines variable selection (L1) with coefficient shrinkage (L2).

This is useful when features are correlated. LASSO might arbitrarily select one; elastic net keeps correlated features.

The mixing parameter controls balance between L1 and L2.

Dropout

Dropout is a regularization technique for neural networks. It randomly drops units during training. This prevents co-adaptation of units.

Dropout acts like an ensemble of different networks. Each training iteration trains a different sub-network.

The dropout rate controls fraction of units dropped. Higher rates provide more regularization.

Early Stopping

Early stopping monitors validation performance during training and stops when it starts degrading.

Implementation

Monitor validation performance after each epoch (or batch). Stop when validation performance stops improving.

The patience parameter controls how many epochs of no improvement to wait. This allows for small fluctuations.

Restore the model weights from the best epoch. This captures the best-performing model.

When It Helps

Early stopping prevents overfitting when training longer leads to degradation. This is common with neural networks.

It saves computation by stopping before convergence. It requires a validation set.

Not all problems show validation degradation. Some models improve indefinitely with more training.

Data Augmentation

Data augmentation increases effective training data size by creating modified versions of existing data.

Image Augmentation

Random rotations, flips, crops, and color changes create modified images. The label stays the same.

This teaches invariance to transformations. The model learns that rotated cats are still cats.

Excessive augmentation might create unrealistic images. Appropriate augmentation depends on the problem.

Text Augmentation

Synonym replacement, random insertion, and back-translation create modified text. This increases training data.

This is more challenging than image augmentation. Text transformations must preserve meaning.

Synonyms might change meaning. Back-translation might distort content.

Other Domains

Audio augmentation adds noise, changes speed, shifts pitch. Tabular data can be augmented by adding noise or generating synthetic data.

Augmentation is domain-specific. What works in one domain might not transfer to others.

Key Takeaways

Underfitting occurs with overly simple models; both training and test performance are poor
Overfitting occurs with overly complex models; training is good but test is poor
The bias-variance tradeoff balances model complexity against generalization
Regularization (L1, L2, dropout) reduces overfitting by constraining model flexibility
Early stopping prevents overfitting by stopping when validation performance degrades
Data augmentation increases effective training data size through transformations

All Topics