← Back to Data Science

All Topics

Advertisement

Learn/Data Science/Machine Learning

Machine Learning Introduction

Topic: Introduction

Advertisement

What is Machine Learning?

Machine Learning (ML) is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It focuses on developing algorithms that can access data and use it to learn patterns.

Types of Machine Learning

1. Supervised Learning

Learning from labeled data where the correct output is known:

  • Classification: Predicting categorical labels
  • Regression: Predicting continuous values

y=f(X)+ϵy = f(X) + \epsilon

Where:

  • y = target variable
  • X = features
  • f = learned function
  • ε = error term

Common Algorithms:

  • Linear Regression
  • Logistic Regression
  • Decision Trees
  • Random Forest
  • Support Vector Machines
  • Neural Networks

2. Unsupervised Learning

Finding patterns in unlabeled data:

  • Clustering: Grouping similar data points
  • Dimensionality Reduction: Reducing feature space

Objective:minCi=1kxCidistance(x,μi)2Objective: \min_{C} \sum_{i=1}^{k} \sum_{x \in C_i} distance(x, \mu_i)^2

Common Algorithms:

  • K-Means Clustering
  • Hierarchical Clustering
  • DBSCAN
  • PCA (Principal Component Analysis)
  • t-SNE

3. Reinforcement Learning

Learning through interaction with an environment:

Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s, a)]

Key Components:

  • Agent: The learner
  • Environment: What the agent interacts with
  • Action: Possible moves the agent can make
  • Reward: Feedback from the environment

The ML Workflow

Data Collection → Data Preprocessing → Feature Engineering →
Model Selection → Training → Evaluation → Hyperparameter Tuning →
Deployment → Monitoring

Model Evaluation Metrics

Classification Metrics

Accuracy=TP+TNTP+TN+FP+FNAccuracy = \frac{TP + TN}{TP + TN + FP + FN}

Precision=TPTP+FPPrecision = \frac{TP}{TP + FP}

Recall=TPTP+FNRecall = \frac{TP}{TP + FN}

F1=2×Precision×RecallPrecision+RecallF1 = \frac{2 \times Precision \times Recall}{Precision + Recall}

AUCROC=01TPR(FPR)dFPRAUC-ROC = \int_0^1 TPR(FPR) dFPR

Regression Metrics

MSE=1ni=1n(yiy^i)2MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2

RMSE=MSERMSE = \sqrt{MSE}

MAE=1ni=1nyiy^iMAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|

R2=1SSresSStotR^2 = 1 - \frac{SS_{res}}{SS_{tot}}

Bias-Variance Tradeoff

Total  Error=Bias2+Variance+Irreducible  ErrorTotal\;Error = Bias^2 + Variance + Irreducible\;Error

  • High Bias (Underfitting): Model is too simple, misses patterns
  • High Variance (Overfitting): Model learns noise, doesn't generalize

Python Implementation

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, mean_squared_error

# Classification example
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

model = RandomForestClassifier(n_estimators=100)
model.fit(X_train_scaled, y_train)
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)

# Regression example
reg_model = LinearRegression()
reg_model.fit(X_train, y_train)
y_pred = reg_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

Overfitting and Underfitting

Overfitting:

  • Too complex model
  • High training accuracy, low test accuracy
  • Solution: Regularization, more data, feature selection

Underfitting:

  • Too simple model
  • Low training and test accuracy
  • Solution: More features, more complex model

Cross-Validation

CV  Score=1ki=1kScoreiCV\;Score = \frac{1}{k}\sum_{i=1}^{k} Score_i

Common k values: 5, 10

from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

Key Takeaways

  1. ML has three main types: supervised, unsupervised, and reinforcement
  2. Choose the right algorithm based on problem type
  3. Evaluation metrics vary by problem type
  4. Bias-variance tradeoff is fundamental to model performance
  5. Cross-validation ensures reliable performance estimates

Advertisement

Advertisement

Need More Practice?

Get personalized data science help from ChatWhole's AI-powered platform.

Get Expert Help →