Understanding Model Fit
Model fit describes how well a model captures patterns in training data. The fundamental challenge is that training data is a sample from a larger population. A model that fits training data perfectly might not generalize to new data.
The relationship between model complexity and generalization is central to machine learning. Too simple models fail to capture important patterns (underfitting). Too complex models fit noise rather than signal (overfitting).
Understanding and managing this tradeoff is essential for building models that generalize well.
Underfitting
Underfitting occurs when models are too simple to capture patterns in the data. Both training and test performance are poor.
Symptoms of Underfitting
Training performance is poor. The model fails to learn basic patterns in the data. Errors are high even on training data.
Test performance is similar to training performance. Both are poor because the model cannot capture the underlying structure.
The model has high bias. It makes strong assumptions that are wrong for the data.
Causes of Underfitting
Insufficient features provide little information for prediction. The model cannot learn relationships that depend on missing variables.
Excessive regularization prevents the model from fitting patterns. Strong penalties might constrain the model too much.
Insufficient model capacity prevents capturing complex patterns. Linear models might be too simple for non-linear relationships.
Insufficient training prevents the model from learning. Not enough data or too few training iterations.
Addressing Underfitting
Add more informative features. Feature engineering might create predictive variables that the model can use.
Reduce regularization strength. Less penalty allows more complex patterns to be fit.
Increase model complexity. More flexible models can capture more patterns. More layers in neural networks, more trees in forests.
Train longer. More iterations allow better convergence.
Overfitting
Overfitting occurs when models capture noise rather than signal. Training performance is excellent but test performance is poor.
Symptoms of Overfitting
Training performance is excellent. Errors are very low. The model appears to fit perfectly.
Test performance is much worse than training. The gap between train and test indicates overfitting.
The model has high variance. Small changes in training data lead to very different models.
Causes of Overfitting
Excessive model complexity fits noise rather than signal. More parameters than the data can support.
Insufficient regularization allows too much flexibility. Without penalty, models can approximate any function.
Insufficient training data doesn't constrain model complexity. More parameters than data points is a warning sign.
Training too long might fit noise. Early stopping might help.
Addressing Overfitting
Reduce model complexity. Simpler models are less likely to fit noise. Fewer parameters, shallower trees, fewer layers.
Increase regularization. Stronger penalties constrain flexibility. L1, L2, dropout.
Add training data. More data provides more constraints on what the model can learn.
Use validation data to stop training early. Monitor validation performance and stop when it starts degrading.
The Bias-Variance Tradeoff
The bias-variance tradeoff describes the relationship between model complexity and generalization.
Bias
Bias is error from overly simplistic models. High-bias models make strong assumptions that might be wrong. They underfit.
High bias shows as poor training performance. The model cannot capture patterns. Simple linear models on non-linear data have high bias.
Reducing bias typically requires more complex models. This might increase variance.
Variance
Variability is error from models being too sensitive to training data. High-variance models change substantially with different training data. They overfit.
High variance shows as large differences between training and test performance. The model fits training data perfectly but fails on new data.
Reducing variance typically requires simpler models or more data. This might increase bias.
Optimal Complexity
The optimal model complexity balances bias and variance. Neither too simple (high bias) nor too complex (high variance).
The optimal depends on data quantity and complexity. More data supports more complex models. Complex patterns require more complex models.
Cross-validation estimates generalization performance for different complexities. The complexity with best validation performance is optimal.
Visualizing the Tradeoff
Visualizations help understand the bias-variance tradeoff.
Learning Curves
Learning curves plot performance versus training set size. Both train and test performance improve with more data.
The gap between train and test shows overfitting. Large gaps indicate high variance. Both curves being high shows underfitting.
The curves plateau when more data doesn't help. This shows what additional data can achieve.
Validation Curves
Validation curves plot performance versus model complexity parameter (like tree depth, regularization strength).
U-shaped validation curves show the tradeoff. Low complexity (high bias), high complexity (high variance), optimal in between.
Finding the minimum of validation curves identifies optimal complexity.
Regularization
Regularization adds penalty for complexity to the training objective. This constrains models and reduces overfitting.
L1 Regularization (LASSO)
L1 penalty adds sum of absolute coefficient values to the objective. This can shrink coefficients to exactly zero.
LASSO performs variable selection automatically. Features with zero coefficients are dropped.
The penalty strength controls amount of shrinkage. Cross-validation selects optimal strength.
L2 Regularization (Ridge)
L2 penalty adds sum of squared coefficient values to the objective. This shrinks coefficients toward zero but rarely to exactly zero.
Ridge handles correlated features well. It doesn't drop features but reduces their influence.
The penalty strength controls shrinkage. Cross-validation selects optimal strength.
Elastic Net
Elastic net combines L1 and L2 penalties. It combines variable selection (L1) with coefficient shrinkage (L2).
This is useful when features are correlated. LASSO might arbitrarily select one; elastic net keeps correlated features.
The mixing parameter controls balance between L1 and L2.
Dropout
Dropout is a regularization technique for neural networks. It randomly drops units during training. This prevents co-adaptation of units.
Dropout acts like an ensemble of different networks. Each training iteration trains a different sub-network.
The dropout rate controls fraction of units dropped. Higher rates provide more regularization.
Early Stopping
Early stopping monitors validation performance during training and stops when it starts degrading.
Implementation
Monitor validation performance after each epoch (or batch). Stop when validation performance stops improving.
The patience parameter controls how many epochs of no improvement to wait. This allows for small fluctuations.
Restore the model weights from the best epoch. This captures the best-performing model.
When It Helps
Early stopping prevents overfitting when training longer leads to degradation. This is common with neural networks.
It saves computation by stopping before convergence. It requires a validation set.
Not all problems show validation degradation. Some models improve indefinitely with more training.
Data Augmentation
Data augmentation increases effective training data size by creating modified versions of existing data.
Image Augmentation
Random rotations, flips, crops, and color changes create modified images. The label stays the same.
This teaches invariance to transformations. The model learns that rotated cats are still cats.
Excessive augmentation might create unrealistic images. Appropriate augmentation depends on the problem.
Text Augmentation
Synonym replacement, random insertion, and back-translation create modified text. This increases training data.
This is more challenging than image augmentation. Text transformations must preserve meaning.
Synonyms might change meaning. Back-translation might distort content.
Other Domains
Audio augmentation adds noise, changes speed, shifts pitch. Tabular data can be augmented by adding noise or generating synthetic data.
Augmentation is domain-specific. What works in one domain might not transfer to others.
Key Takeaways
- Underfitting occurs with overly simple models; both training and test performance are poor
- Overfitting occurs with overly complex models; training is good but test is poor
- The bias-variance tradeoff balances model complexity against generalization
- Regularization (L1, L2, dropout) reduces overfitting by constraining model flexibility
- Early stopping prevents overfitting by stopping when validation performance degrades
- Data augmentation increases effective training data size through transformations