← Back to Data Science

All Topics

Advertisement

Learn/Data Science/Machine Learning

Encoding Categorical Data

Topic: Preprocessing

Advertisement

Categorical Encoding

Machine learning requires numerical features.

One-Hot Encoding

pd.get_dummies(df['col']) creates binary columns. OneHotEncoder from sklearn does same.

Drop first category to avoid multicollinearity: drop='first'. Handle unknown categories: handle_unknown='ignore'.

Label Encoding

LabelEncoder encodes categories as integers. OrdinalEncoder for multiple columns.

Useful for tree-based methods that handle integers.

Target Encoding

Mean of target for each category. Requires careful handling to prevent leakage.

Smoothing blends with global mean: (n * category_mean + m * global_mean) / (n + m).

Key Takeaways

  1. One-hot creates binary columns for categories
  2. Label encoding converts to integers
  3. Target encoding requires careful implementation

Advertisement

Advertisement

Need More Practice?

Get personalized data science help from ChatWhole's AI-powered platform.

Get Expert Help →