Introduction
Linear Regression is a fundamental supervised learning algorithm that models the relationship between variables using a linear equation.
Simple Linear Regression
Where:
is the target variableis the predictoris the interceptis the slopeis the error term
Multiple Linear Regression
Python Implementation
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Sample data
X = pd.DataFrame({
'sqft': [1000, 1500, 2000, 2500, 3000],
'bedrooms': [2, 3, 3, 4, 4],
'age': [10, 5, 15, 8, 20]
})
y = np.array([200000, 280000, 350000, 420000, 480000])
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Coefficients
print(f"Intercept: {model.intercept_}")
print(f"Coefficients: {model.coef_}")
# Predict
y_pred = model.predict(X_test)
# Evaluate
print(f"R2 Score: {r2_score(y_test, y_pred)}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred))}")
Assumptions of Linear Regression
- Linearity: Relationship between X and y is linear
- Independence: Observations are independent
- Homoscedasticity: Constant variance of residuals
- Normality: Residuals are normally distributed
- No multicollinearity: Predictors not highly correlated
Checking Assumptions
import matplotlib.pyplot as plt
from scipy import stats
# Residuals
residuals = y_test - y_pred
# Residual plot
plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='r', linestyle='--')
# Q-Q plot
stats.probplot(residuals, dist="norm", plot=plt)
plt.show()
Regularized Linear Regression
from sklearn.linear_model import Ridge, Lasso, ElasticNet
# L2 Regularization (Ridge)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
# L1 Regularization (Lasso) - Feature selection
lasso = Lasso(alpha=1.0)
lasso.fit(X_train, y_train)
# ElasticNet (L1 + L2)
elastic = ElasticNet(alpha=1.0, l1_ratio=0.5)
elastic.fit(X_train, y_train)
Key Takeaways
- Linear regression models linear relationships
- Check assumptions before trusting results
- Regularization prevents overfitting
- Lasso can perform feature selection