Categorical Encoding
Machine learning requires numerical features.
One-Hot Encoding
pd.get_dummies(df['col']) creates binary columns. OneHotEncoder from sklearn does same.
Drop first category to avoid multicollinearity: drop='first'. Handle unknown categories: handle_unknown='ignore'.
Label Encoding
LabelEncoder encodes categories as integers. OrdinalEncoder for multiple columns.
Useful for tree-based methods that handle integers.
Target Encoding
Mean of target for each category. Requires careful handling to prevent leakage.
Smoothing blends with global mean: (n * category_mean + m * global_mean) / (n + m).
Key Takeaways
- One-hot creates binary columns for categories
- Label encoding converts to integers
- Target encoding requires careful implementation