← Back to Data Science

All Topics

Advertisement

Learn/Data Science/Machine Learning Fundamentals

Unsupervised Learning Overview

Topic: Unsupervised Learning

Advertisement

Introduction

Unsupervised learning finds patterns in data without labels. It's used for clustering, dimensionality reduction, and anomaly detection.

Types of Unsupervised Learning

  1. Clustering - Group similar data points
  2. Dimensionality Reduction - Reduce features while preserving structure
  3. Association - Find association rules
  4. Anomaly Detection - Identify unusual patterns

Clustering Algorithms

from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.cluster import SpectralClustering

Dimensionality Reduction

from sklearn.decomposition import PCA, LDA
from sklearn.manifold import TSNE, UMAP

K-Means Clustering

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Generate data
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, centers=4, random_state=42)

# Find optimal k using elbow method
inertias = []
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

# Train with optimal k
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)

# Visualize
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], 
            kmeans.cluster_centers_[:, 1], 
            c='red', marker='x', s=200)

DBSCAN

from sklearn.cluster import DBSCAN

# DBSCAN doesn't require specifying number of clusters
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X)

# Identifies outliers as label -1

Principal Component Analysis (PCA)

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Scale data first
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {sum(pca.explained_variance_ratio_):.3f}")

t-SNE for Visualization

from sklearn.manifold import TSNE
from sklearn.datasets import load_digits

digits = load_digits()
X_tsne = TSNE(n_components=2, random_state=42).fit_transform(digits.data)

plt.scatter(X_tsne[:, 0], X_tsne[:, 1], 
            c=digits.target, cmap='tab10')

Gaussian Mixture Models

from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=4, random_state=42)
labels = gmm.fit_predict(X)

# Get probability of each point belonging to each cluster
probs = gmm.predict_proba(X)

Key Takeaways

  1. Unsupervised learning finds patterns without labels
  2. Clustering groups similar data points
  3. PCA reduces dimensions while preserving variance
  4. t-SNE great for visualization, not for prediction

Advertisement

Advertisement

Need More Practice?

Get personalized data science help from ChatWhole's AI-powered platform.

Get Expert Help →