Introduction
Scikit-Learn provides preprocessing tools for feature scaling, encoding, and transformation.
Standard Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np
data = np.array([[1, 2], [3, 4], [5, 6]])
# Z-score standardization
scaler = StandardScaler()
scaled = scaler.fit_transform(data)
print(scaled.mean(axis=0)) # [0, 0]
print(scaled.std(axis=0)) # [1, 1]
# Min-Max scaling
minmax = MinMaxScaler()
normalized = minmax.fit_transform(data)
Label Encoding
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import numpy as np
# Simple label encoding
labels = ['cat', 'dog', 'bird', 'cat', 'dog']
encoder = LabelEncoder()
encoded = encoder.fit_transform(labels)
print(encoded) # [0 2 1 0 2]
# Inverse transform
decoded = encoder.inverse_transform([0, 1, 2])
# One-hot encoding
onehot = OneHotEncoder(sparse_output=False)
reshaped = np.array(labels).reshape(-1, 1)
onehot_encoded = onehot.fit_transform(reshaped)
Polynomial Features
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
X = np.array([[1, 2], [3, 4]])
# Create degree 2 polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
print(X_poly.shape) # (2, 6)
print(poly.get_feature_names_out())
Custom Transformers
from sklearn.preprocessing import FunctionTransformer
import numpy as np
# Log transformation
log_transformer = FunctionTransformer(np.log1p, inverse=np.expm1)
data = np.array([[0, 1], [2, 3]])
transformed = log_transformer.fit_transform(data)
# Square transformation
square_transformer = FunctionTransformer(lambda x: x**2, inverse_func=np.sqrt)
Practice Problems
- Scale features using StandardScaler
- Encode categorical labels
- Create polynomial features
- Build custom transformer pipeline
- Handle missing values with imputation