Introduction
Unsupervised learning finds patterns in data without labels. It's used for clustering, dimensionality reduction, and anomaly detection.
Types of Unsupervised Learning
- Clustering - Group similar data points
- Dimensionality Reduction - Reduce features while preserving structure
- Association - Find association rules
- Anomaly Detection - Identify unusual patterns
Clustering Algorithms
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.cluster import SpectralClustering
Dimensionality Reduction
from sklearn.decomposition import PCA, LDA
from sklearn.manifold import TSNE, UMAP
K-Means Clustering
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Generate data
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, centers=4, random_state=42)
# Find optimal k using elbow method
inertias = []
for k in range(1, 10):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
# Train with optimal k
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)
# Visualize
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0],
kmeans.cluster_centers_[:, 1],
c='red', marker='x', s=200)
DBSCAN
from sklearn.cluster import DBSCAN
# DBSCAN doesn't require specifying number of clusters
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X)
# Identifies outliers as label -1
Principal Component Analysis (PCA)
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Scale data first
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {sum(pca.explained_variance_ratio_):.3f}")
t-SNE for Visualization
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
digits = load_digits()
X_tsne = TSNE(n_components=2, random_state=42).fit_transform(digits.data)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1],
c=digits.target, cmap='tab10')
Gaussian Mixture Models
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=4, random_state=42)
labels = gmm.fit_predict(X)
# Get probability of each point belonging to each cluster
probs = gmm.predict_proba(X)
Key Takeaways
- Unsupervised learning finds patterns without labels
- Clustering groups similar data points
- PCA reduces dimensions while preserving variance
- t-SNE great for visualization, not for prediction