Machine Learning Introduction

Topic: Introduction

What is Machine Learning?

Machine Learning (ML) is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It focuses on developing algorithms that can access data and use it to learn patterns.

Types of Machine Learning

1. Supervised Learning

Learning from labeled data where the correct output is known:

Classification: Predicting categorical labels
Regression: Predicting continuous values

$y = f(X) + \epsilon$

Where:

y = target variable
X = features
f = learned function
ε = error term

Common Algorithms:

Linear Regression
Logistic Regression
Decision Trees
Random Forest
Support Vector Machines
Neural Networks

2. Unsupervised Learning

Finding patterns in unlabeled data:

Clustering: Grouping similar data points
Dimensionality Reduction: Reducing feature space

$Objective: \min_{C} \sum_{i=1}^{k} \sum_{x \in C_i} distance(x, \mu_i)^2$

Common Algorithms:

K-Means Clustering
Hierarchical Clustering
DBSCAN
PCA (Principal Component Analysis)
t-SNE

3. Reinforcement Learning

Learning through interaction with an environment:

$Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s, a)]$

Key Components:

Agent: The learner
Environment: What the agent interacts with
Action: Possible moves the agent can make
Reward: Feedback from the environment

The ML Workflow

Data Collection → Data Preprocessing → Feature Engineering →
Model Selection → Training → Evaluation → Hyperparameter Tuning →
Deployment → Monitoring

Model Evaluation Metrics

Classification Metrics

$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$

$Precision = \frac{TP}{TP + FP}$

$Recall = \frac{TP}{TP + FN}$

$F1 = \frac{2 \times Precision \times Recall}{Precision + Recall}$

$AUC-ROC = \int_0^1 TPR(FPR) dFPR$

Regression Metrics

$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$

$RMSE = \sqrt{MSE}$

$MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$

$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$

Bias-Variance Tradeoff

$Total\;Error = Bias^2 + Variance + Irreducible\;Error$

High Bias (Underfitting): Model is too simple, misses patterns
High Variance (Overfitting): Model learns noise, doesn't generalize

Python Implementation

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, mean_squared_error

# Classification example
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

model = RandomForestClassifier(n_estimators=100)
model.fit(X_train_scaled, y_train)
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)

# Regression example
reg_model = LinearRegression()
reg_model.fit(X_train, y_train)
y_pred = reg_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

Overfitting and Underfitting

Overfitting:

Too complex model
High training accuracy, low test accuracy
Solution: Regularization, more data, feature selection

Underfitting:

Too simple model
Low training and test accuracy
Solution: More features, more complex model

Cross-Validation

$CV\;Score = \frac{1}{k}\sum_{i=1}^{k} Score_i$

Common k values: 5, 10

from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

Key Takeaways

ML has three main types: supervised, unsupervised, and reinforcement
Choose the right algorithm based on problem type
Evaluation metrics vary by problem type
Bias-variance tradeoff is fundamental to model performance
Cross-validation ensures reliable performance estimates

Need More Practice?

Get personalized data science help from ChatWhole's AI-powered platform.

Get Expert Help →

All Topics