Feature engineering is the process of taking raw data and transforming it into inputs that help a machine learning model learn effectively. The model doesn’t see the world the way you do. It sees numbers. Feature engineering is the work of translating your data into a numerical form that carries the right information for the problem you’re trying to solve.
Why It’s Important
A model can only learn from what you give it. If the inputs are poorly constructed, the model will struggle regardless of how sophisticated it is. Good feature engineering can make a simple model perform extremely well. Bad feature engineering can make a powerful model perform poorly.
This is where domain knowledge pays off. Understanding what actually drives the outcome you’re predicting helps you decide which aspects of the data are worth capturing, which combinations might be meaningful, and which variables are likely to be noise.
What a Feature Is
A feature is any input variable you pass to a model. If you’re predicting employee attrition, features might include tenure, department, number of direct reports, recent performance scores, and how often someone has changed roles internally. Each of these is a feature. The set of all features you use is called the feature set.
Raw data often contains the seeds of good features but not the features themselves. A timestamp isn’t particularly useful on its own. The day of the week, the hour, whether it falls on a holiday, and the number of days since the last event are all features you could extract from that timestamp. That extraction process is feature engineering.
Common Feature Engineering Techniques
The right techniques depend on the data type and the problem, but some come up in almost every project.
For numerical data:
- Scaling and normalization: Bringing values onto a comparable range so that features measured in thousands don’t outweigh features measured in single digits. Standard approaches include min-max scaling and standardization.
- Binning: Grouping continuous values into discrete buckets. Turning exact age into age ranges, for example, can help a model generalize better when the precise value matters less than the bracket it falls into.
- Log transformation: Compressing skewed distributions where a small number of very large values would otherwise distort the model’s learning.
For categorical data:
- One-hot encoding: Converting a category like “color” with values red, blue, green into three separate binary columns. Simple and widely used, but expands the feature set quickly when cardinality is high.
- Target encoding: Replacing a category with the average target value for that category across the training set. Useful for high-cardinality columns but requires care to avoid data leakage.
- Ordinal encoding: Assigning numbers to categories that have a natural order, like small, medium, large becoming 1, 2, 3.
For time-based data:
- Extracting components: Pulling out year, month, day, hour, day of week, or quarter from a timestamp.
- Lag features: Using past values of a variable as features. This is common in forecasting where recent history predicts near-term behavior.
- Rolling aggregates: Computing a mean, sum, or standard deviation over a sliding time window to capture trends
Creating New Features from Existing Ones
Some of the most useful features don’t exist in the raw data at all. They come from combining or transforming what’s already there.
Interaction features multiply or combine two variables to capture a relationship the model might not discover on its own. In a model predicting loan default, income divided by total debt load might be more predictive than either variable separately. In an e-commerce model, the ratio of a customer’s average order value to the site-wide average might capture purchasing behavior better than the raw number.
This is where domain knowledge earns its place. Someone who understands the business or the problem intuitively knows which combinations are likely to be meaningful. A model searching blindly across all possible combinations will take much longer to find them, if it finds them at all.
Handling Missing Data
Real datasets have gaps. How you handle them is itself a feature engineering decision, and it affects model quality more than you might expect.
Simple approaches fill missing values with the mean, median, or mode of the column. This works reasonably well when data is missing at random and the proportion is small. When data is missing for a reason, though, the absence itself is informative. A customer who never filled in their phone number behaves differently from one who did. In cases like that, adding a binary flag that marks whether the value was missing, alongside whatever fill value you use, gives the model a chance to learn from the pattern.
Feature Selection
More features aren’t always better. Irrelevant or redundant features add noise, slow down training, and can hurt generalization. Feature selection is the process of deciding which features to keep.
Some approaches are simple. For example, removing features with very low variance, or dropping columns that are too highly correlated with each other. Others are more involved, using statistical tests to measure how much each feature relates to the target, or training a model and examining which features it actually uses. Regularization techniques like Lasso can also drive less useful feature weights toward zero during training, effectively performing selection automatically.
The Risk of Data Leakage
Data leakage is when information about the target variable sneaks into your features in a way that won’t be available in production. A model trained on leaked features will look great in evaluation and fail in deployment.
It happens more easily than you’d think. Using a customer’s final lifetime value as a feature when predicting whether they’ll churn. Computing a rolling average that accidentally includes future data points. Encoding a category using target statistics calculated on the full dataset instead of just the training fold. Each of these gives the model information it wouldn’t have in a real prediction scenario.
The safest habit is to always ask: would this feature actually be available at the moment the model needs to make a prediction? If the answer is no, it shouldn’t be in the feature set.
Feature Engineering and Modern Deep Learning
Deep learning has reduced the need for manual feature engineering in some domains. Image models learn their own visual features from raw pixels. Language models learn representations from raw text. In these areas, feeding well-prepared raw data to a powerful architecture often outperforms hand-crafted features.
But in tabular data problems, which cover a huge portion of real business applications, feature engineering still matters a great deal. Gradient boosting models, which dominate Kaggle competitions and many production use cases, benefit substantially from thoughtful feature construction. And even in deep learning contexts, decisions about normalization, encoding, and how to represent time or categorical variables remain important.
The tools have changed. The underlying need to give models good information to learn from hasn’t.