## Definition **One-hot encoding** turns a categorical variable with $k$ possible values into $k$ binary columns, exactly one of which is 1 for each row. The default encoding for nominal (unordered) categorical features in models that require numeric input. ## Example Original column `Color ∈ {Red, Green, Blue}`: | Color | → | Red | Green | Blue | |--------|---|-----|-------|------| | Red | | 1 | 0 | 0 | | Blue | | 0 | 0 | 1 | | Green | | 0 | 1 | 0 | ## Why Not Just Map to Integers? Mapping `{Red: 1, Green: 2, Blue: 3}` introduces a fake ordering. A linear model would treat "Blue is three times more X than Red" — meaningless. One-hot avoids the ordinal illusion: each category is its own dimension. ## Variants ### Dummy encoding (drop one) Use $k - 1$ columns; the all-zeros row represents the dropped category. Avoids perfect multicollinearity (the original $k$ columns sum to 1, an exact linear dependency). Standard for linear regression where collinearity matters. ### Drop-first one-hot scikit-learn's `OneHotEncoder(drop='first')` does this automatically. ### Effect coding Similar but the reference category is encoded as -1 across columns; useful in some statistical analyses. ## Trade-offs **Pros:** - No fake ordering. - Each category fully isolated. - Trivial to implement. **Cons:** - **High cardinality blows up.** A `user_id` column with 100k unique values → 100k binary columns. Sparse and memory-heavy. - **Cold-start problem.** A new category unseen at training time has no encoding. Need an "unknown" bucket. - **Loses information.** Two semantically similar categories ("XL", "XXL") are as distant in the encoded space as two completely unrelated ones. ## Alternatives for High Cardinality - **Target encoding.** Replace category with mean target value (within fold to avoid leakage). - **Frequency encoding.** Replace with count or proportion. - **Embeddings.** Learn a dense vector per category — bridge to deep tabular models. - **Hashing trick.** Hash the category into a fixed-size bucket space. Some collisions; fixed dimensionality. ## With Tree-Based Models Tree models (random forest, gradient boosting) handle categorical features natively in modern implementations (XGBoost, LightGBM, CatBoost — the last one is specifically built around categorical handling). One-hot still works but is rarely optimal. ## Related - [[Feature Engineering]] - [[Feature Scaling]] - [[Logistic Regression]]