Feature Engineering - Albert Masoliver's learning site

## Definition **Feature engineering** is the process of transforming raw data into features that better expose the predictive signal to a model. For decades the most leveraged activity in ML; partially superseded by deep learning for unstructured data (text, images) but still decisive for tabular, time-series, and structured data. ## Why It Matters A model's input is the feature vector, not the raw data. Good features: - **Highlight predictive structure** the model can exploit. - **Hide irrelevant noise** that would dilute the signal. - **Encode domain knowledge** explicitly, reducing data requirements. - **Match the model's [[Inductive Bias]].** Tree-based models like piecewise-constant features; linear models like centred, scaled features. ## Common Transformations ### Numeric - **Scaling** ([[Feature Scaling]]) — standardisation, min-max, robust scaling. - **Log / Box-Cox transforms** — for right-skewed distributions (income, prices). - **Binning** — convert continuous to discrete (age → age bracket). - **Interaction features** — products or ratios of two raw features. - **Polynomial features** — $x_1^2$, $x_1 x_2$ etc. expand the feature space. ### Categorical - **[[One-Hot Encoding]]** — binary indicator per category. - **Target encoding** — replace each category with its mean target value (careful with leakage). - **Embeddings** — learn a low-dimensional vector per category. The bridge to deep learning for tabular data. - **Ordinal encoding** — assign ordered numbers when the categorical has natural order. ### Temporal - **Cyclical encoding** — `sin(2πt/24)`, `cos(2πt/24)` for hour-of-day, day-of-year. Better than treating hour as ordinal. - **Lagged features** — value 1, 7, 30 periods ago. - **Rolling statistics** — moving average, moving std, moving min/max. - **Time-since-event** — recency features. ### Text (pre-LLM) - **Bag-of-words.** - **TF-IDF.** - **n-grams.** - **Topic models** (LDA). (Modern: feed text directly to a model or use pre-trained embeddings.) ## Feature Stores In production ML, *features* themselves become first-class artefacts: - Stored, versioned, served at training and inference time. - Reused across models and teams. - Tested independently of any model. Tools: Feast, Tecton, Databricks Feature Store. ## Deep Learning Caveat For images, audio, and increasingly text, learned representations from neural networks outperform hand-engineered features. The "Bitter Lesson": engineering effort tends to be displaced by scale + learning. But for *tabular* data — most enterprise ML — engineered features still dominate, often combined with gradient-boosted trees. ## Related - [[Feature Scaling]] - [[One-Hot Encoding]] - [[Feature Selection]] - [[Dimensionality Reduction]]