Feature Engineering Techniques That Boost Model Accuracy
Jun 19, 2026 5 Min Read 21 Views
(Last Updated)
Table of contents
- Quick TL;DR
- Introduction
- Technique 1: Feature Scaling and Normalisation
- Technique 2: Encoding Categorical Variables
- Technique 3: Log Transform for Skewed Distributions
- Technique 4: Extracting Features from Datetime Columns
- Technique 5: Polynomial and Interaction Features
- Technique 6: Binning Continuous Variables
- Technique 7: Intelligent Missing Value Imputation
- Feature Engineering Techniques: Quick Reference
- Common Mistakes in Feature Engineering
- Conclusion
- FAQs
- What is feature engineering in machine learning?
- Why is feature engineering important?
- What is the difference between normalisation and standardisation?
- When should I use one-hot encoding vs target encoding?
- What is a log transform and when should I apply it?
- What is data leakage in feature engineering?
- How do interaction features improve model accuracy?
Quick TL;DR
- Feature engineering transforms raw data into meaningful inputs that help machine learning models learn patterns more effectively.
- Key techniques include scaling, encoding categorical variables, log transformations, feature creation, and handling missing values.
- Strong feature engineering techniques often improves performance more than model tuning alone.
- It is also a highly important topic in real-world ML projects, competitions, and data science interviews.
Introduction
Feature engineering transforms raw data into meaningful features that help machine learning models uncover patterns more effectively. It often has a greater impact on accuracy than choosing a specific algorithm. Techniques such as scaling, encoding, feature creation, and handling missing values can significantly improve model performance. This guide explores the most important feature engineering methods with practical Python examples.
Want to build end-to-end machine learning pipelines — from raw data cleaning and feature engineering through model training, evaluation, and deployment — with structured guidance and real industry datasets? Check out HCL GUVI’s Data science Programme built for aspiring data scientists and ML engineers who want to crack interviews at product-based companies and build models that work in production.
Technique 1: Feature Scaling and Normalisation
Distance-based algorithms K-Nearest Neighbours, Support Vector Machines, and neural networks treat features with large numeric ranges as more important simply because their raw values are bigger. A feature like annual income (0–500,000) will completely dominate a feature like age (0–80) before a single training step runs. Feature scaling eliminates this bias by bringing all features onto a comparable range.
| StandardScaler vs MinMaxScaler from sklearn.preprocessing import StandardScaler, MinMaxScaler import pandas as pd df = pd.DataFrame({‘age’: [25, 40, 35], ‘income’: [30000, 90000, 60000]}) # StandardScaler: mean=0, std=1 — use when distribution is roughly normal std_scaler = StandardScaler() df_standard = std_scaler.fit_transform(df) # MinMaxScaler: range [0, 1] — use when you need bounded output mm_scaler = MinMaxScaler() df_minmax = mm_scaler.fit_transform(df) |
Want to build end-to-end machine learning pipelines — from raw data cleaning and feature engineering through model training, evaluation, and deployment — with structured guidance and real industry datasets? Check out HCL GUVI’s Data science Programme built for aspiring data scientists and ML engineers who want to crack interviews at product-based companies and build models that work in production.
Technique 2: Encoding Categorical Variables
Machine learning algorithms operate on numbers, not strings. A column containing values like ‘Mumbai’, ‘Delhi’, and ‘Chennai’ must be converted to a numeric representation before any model can use it. Choosing the wrong encoding strategy is one of the most common sources of avoidable accuracy loss.
| One-Hot Encoding — for Low-Cardinality Columns import pandas as pd df = pd.DataFrame({‘city’: [‘Mumbai’, ‘Delhi’, ‘Mumbai’, ‘Chennai’]}) # Creates a binary column for each unique category df_encoded = pd.get_dummies(df, columns=[‘city’], drop_first=True) # Result: city_Delhi, city_Mumbai (Chennai becomes the reference) Target Encoding — for High-Cardinality Columns # Use when a column has hundreds of unique values # (e.g. postal codes, product SKUs, user IDs) # Replace each category with the mean of the target variable target_means = df.groupby(‘city’)[‘sale_price’].mean() df[‘city_encoded’] = df[‘city’].map(target_means) # Always use cross-validation folds to compute target means # to prevent target leakage in training data |
One-hot encoding is safe for columns with fewer than 15–20 unique values. Beyond that, use target encoding or embeddings. Target encoding can introduce leakage; always compute encoding means on out-of-fold data during cross-validation.
Technique 3: Log Transform for Skewed Distributions
Real-world numerical features like house prices, transaction amounts, page view counts, and salaries are almost always right-skewed — a small number of extreme outliers pull the distribution far from normal. Linear models, which assume normally distributed errors, perform poorly on skewed features. A log transform compresses the long tail and makes the distribution approximately normal.
| import numpy as np import pandas as pd df = pd.DataFrame({‘sale_price’: [50000, 120000, 850000, 1200000, 75000]}) # log1p = log(x + 1) — handles zero values safely df[‘log_price’] = np.log1p(df[‘sale_price’]) # After prediction, reverse with expm1 to get original scale # df[‘predicted_price’] = np.expm1(model.predict(X_test)) |
In the famous Kaggle House Prices competition, one of the most consistently impactful improvements across high-ranking solutions was not a complex model architecture, but a simple preprocessing step: applying a log transformation to the target variable SalePrice. This transformation helps stabilize variance, reduce skewness, and make the target distribution more Gaussian-like, which in turn improves model performance for many regression algorithms. The result highlights an important machine learning lesson: thoughtful data preprocessing can often contribute more to performance gains than switching to more advanced models.
Technique 4: Extracting Features from Datetime Columns
A raw timestamp like 2024-11-15 14:32:00 is nearly useless to a model as a single value. But the information it contains — the hour of the day, the day of the week, whether it falls on a weekend, the month, the quarter, the number of days since a reference event — can be extraordinarily predictive. Datetime decomposition extracts all of this latent signal into separate numeric columns.
| import pandas as pd df[‘timestamp’] = pd.to_datetime(df[‘timestamp’]) df[‘hour’] = df[‘timestamp’].dt.hour df[‘day_of_week’] = df[‘timestamp’].dt.dayofweek # 0=Monday df[‘is_weekend’] = df[‘day_of_week’].isin([5, 6]).astype(int) df[‘month’] = df[‘timestamp’].dt.month df[‘quarter’] = df[‘timestamp’].dt.quarter df[‘days_since_launch’] = (df[‘timestamp’] – df[‘timestamp’].min()).dt.days |
Technique 5: Polynomial and Interaction Features
Linear models assume that the relationship between each input feature and the target is a straight line. When the true relationship is curved — or when two features combine to produce signal that neither carries alone — a linear model trained on raw features will underfit no matter how well you tune its regularisation.
| from sklearn.preprocessing import PolynomialFeatures import pandas as pd df = pd.DataFrame({‘area’: [800, 1200, 1500], ‘rooms’: [2, 3, 4]}) # Degree-2 polynomial: adds area², rooms², area×rooms poly = PolynomialFeatures(degree=2, include_bias=False) df_poly = pd.DataFrame( poly.fit_transform(df), columns=poly.get_feature_names_out(df.columns) ) # Manual interaction feature — often more interpretable df[‘price_per_room’] = df[‘area’] / df[‘rooms’] |
Use PolynomialFeatures with degree=2 as a starting point. Degree-3 and higher quickly generate hundreds of columns and overfit on small datasets. Manual interaction features — domain-guided combinations like price per square foot or revenue per user — are almost always more effective than exhaustive polynomial expansion.
Technique 6: Binning Continuous Variables
Some continuous features have a non-linear, threshold-based relationship with the target. Age, for example, may have a very different effect on loan default risk below 25, between 25 and 60, and above 60 — but the raw numeric value encodes none of these boundaries.
| import pandas as pd df = pd.DataFrame({‘age’: [17, 25, 34, 52, 68, 71]}) # Equal-width bins df[‘age_bin’] = pd.cut(df[‘age’], bins=[0, 25, 45, 65, 100], labels=[‘young’, ‘adult’, ‘middle’, ‘senior’]) # Quantile bins — equal number of records per bucket df[‘age_qbin’] = pd.qcut(df[‘age’], q=4, labels=[‘Q1′,’Q2′,’Q3′,’Q4’]) |
Technique 7: Intelligent Missing Value Imputation
Missing values are not random noise — their absence is often itself a signal. A missing income field on a loan application may indicate self-employment. A missing test result may indicate a test was not ordered because a clinician judged it unnecessary.
| import pandas as pd from sklearn.impute import SimpleImputer, KNNImputer # Step 1: Add a missingness indicator before imputing df[‘income_missing’] = df[‘income’].isna().astype(int) # Step 2a: Median imputation — robust to outliers simple_imp = SimpleImputer(strategy=’median’) df[‘income_imputed’] = simple_imp.fit_transform(df[[‘income’]]) # Step 2b: KNN imputation — uses similar rows to fill gaps knn_imp = KNNImputer(n_neighbors=5) df_knn = pd.DataFrame(knn_imp.fit_transform(df[[‘income’, ‘age’]]), columns=[‘income’, ‘age’]) |
Feature Engineering Techniques: Quick Reference
| Technique | Input Type | What It Solves | Common Algorithms Helped |
| Normalisation / Scaling | Numerical | Features on different scales dominate distance-based models | KNN, SVM, Neural Networks |
| One-Hot Encoding | Categorical | Models cannot interpret string labels | Linear models, tree ensembles |
| Target Encoding | High-cardinality categorical | One-hot creates too many columns | Gradient boosting, XGBoost |
| Log Transform | Skewed numerical | Outliers distort mean-based models | Linear regression, GLMs |
| Polynomial Features | Numerical | Linear models cannot capture curved relationships | Linear/logistic regression |
| Date / Time Decomposition | Datetime | Raw timestamps carry no semantic signal | All model types |
| Interaction Features | Any | Combined signal stronger than individual features | Linear models, boosting |
Common Mistakes in Feature Engineering
1. Data leakage through improper scaling: Fitting a StandardScaler or computing target encoding means on the full dataset — including test rows — leaks future information into training. Always fit transformations on training data only, then apply them to validation and test sets.
2. One-hot encoding high-cardinality columns: Applying one-hot encoding to a column with 500 unique values creates 500 new binary columns. This increases memory use, slows training, and introduces the curse of dimensionality. Use target encoding or hashing for high-cardinality categoricals.
3. Dropping rather than engineering missing values: Dropping rows with missing values discards potentially predictive records. Adding a binary missingness indicator column and then imputing preserves the signal that absence itself carries — a frequently overlooked accuracy gain.
Conclusion
Feature engineering is often where the biggest machine learning accuracy gains come from. Scaling, encoding, and log transforms help models handle numerical and categorical data more effectively. Datetime, polynomial, and interaction features reveal hidden patterns and relationships in the data. Binning can simplify noisy continuous variables into meaningful, model-friendly categories.
FAQs
1. What is feature engineering in machine learning?
Feature engineering transforms raw data into meaningful features that better represent patterns in the dataset.It includes scaling, encoding, handling missing values, and creating new variables Better features often lead to better model performance.
2. Why is feature engineering important?
Feature engineering helps models identify useful patterns more effectively. Well-engineered features can improve accuracy more than choosing a complex algorithm. It is especially important for structured tabular data.
3. What is the difference between normalisation and standardisation?
Normalisation scales values to a fixed range, usually 0 to 1 Standardisation transforms data to have zero mean and unit variance.Choose based on data distribution and model requirements.
4. When should I use one-hot encoding vs target encoding?
Use one-hot encoding for low-cardinality categorical features. Use target encoding for high-cardinality categories to avoid many columns. Apply target encoding carefully to prevent data leakage.
5. What is a log transform and when should I apply it?
A log transform reduces the impact of very large values.It is useful for right-skewed distributions with extreme outliers.Use log1p() when the feature contains zero values.
6. What is data leakage in feature engineering?
Data leakage occurs when training uses information from validation or test data It produces unrealistically high evaluation scores Prevent it by fitting transformations only on training data.
7. How do interaction features improve model accuracy?
Interaction features capture relationships between multiple variables.They reveal patterns that individual features may not show alone Linear models benefit significantly from these engineered combinations.



Did you enjoy this article?