ML Optimization
Optimization finds parameters minimizing loss.
Gradient Descent
Parameters: θ = θ - α∇J(θ). Learning rate α controls step size.
Batch GD: all data per step. Stochastic GD: one sample. Mini-batch: small batches.
Adaptive Methods
Adam: adaptive learning rates, momentum. RMSprop: divides by gradient magnitude.
Adam often works well. Learning rate scheduling: decay over time.
Second-Order
Newton's method: uses Hessian. L-BFGS: quasi-Newton approximation.
More expensive but faster convergence. Not always better in practice.
Key Takeaways
- Gradient descent is basic optimization
- Adam usually works well
- Learning rate is critical