Hyperparameter Tuning: Techniques and Best Practices
Jun 03, 2026 7 Min Read 43 Views
(Last Updated)
A machine learning model is only as good as its configuration. The choice of algorithm matters, the quality of training data matters, but so does an often-underestimated third factor: the hyperparameters.
Hyperparameters are the settings that govern how a model learns. They are not learned from that, but they are set before training begins, by the practitioner. Choose them well, and your model converges faster, generalises better, and performs reliably on unseen data. Choose them poorly, and the same algorithm on the same data produces an overfit, underfit, or simply mediocre model.
Hyperparameter tuning is the systematic process of finding the configuration that maximises model performance. It is part science, part engineering, and part informed search, and it is one of the most practically important skills in machine learning.
This article explains what hyperparameters are, why they matter, and how the main tuning strategies, e.g., Grid Search, Randomised Search, Bayesian Optimisation, and AutoML, work in practice.
Table of contents
- TL;DR
- Hyperparameters vs. Parameters: A Critical Distinction
- Model Parameters: Learned from Data
- Hyperparameters: Set Before Training
- Why Hyperparameter Tuning Matters
- Overfitting and Underfitting
- The Impact on Production Performance
- Cross-Validation: The Foundation of Reliable Tuning
- Why a Single Validation Set Is Insufficient
- K-Fold Cross-Validation
- Stratified K-Fold
- Grid Search: Exhaustive Hyperparameter Search
- How GridSearchCV Works
- Strengths of Grid Search
- Limitations of Grid Search
- Randomised Search: Efficient Sampling
- How RandomizedSearchCV Works
- Choosing n_iter
- Bayesian Optimisation: Intelligent Search
- How Bayesian Optimisation Works
- Why Bayesian Optimisation Outperforms Random Search
- AutoML and Modern Tuning Frameworks
- Optuna
- Ray Tune
- H2O AutoML and Auto-sklearn
- Learning Rate Tuning and Practical Best Practices
- The Learning Rate Finder
- Learning Rate Scheduling
- Practical Tuning Best Practices
- Conclusion
- FAQs
- What is the difference between hyperparameters and parameters?
- When should I use Grid Search vs. Randomised Search?
- What makes Bayesian optimisation better than random search?
- What is cross-validation's role in hyperparameter tuning?
- Can AutoML replace manual hyperparameter tuning?
TL;DR
- Hyperparameters are configuration settings set before training that govern model structure and learning behaviour.
- Poor hyperparameter choices cause overfitting or underfitting; good ones produce models that generalise reliably.
- GridSearchCV exhaustively tests all combinations; RandomizedSearchCV samples a subset faster and often equally effective.
- Bayesian optimisation uses past results to guide future searches, achieving better results with fewer evaluations.
- AutoML platforms like Optuna, Ray Tune, and H2O AutoML automate the entire tuning pipeline for production workflows
What Is Hyperparameter Tuning?
Hyperparameter tuning is the process of finding the optimal values for a machine learning model’s hyperparameters—settings that are defined before training and control how the model learns. Unlike model parameters, which are learned from data during training, hyperparameters determine aspects such as model complexity, learning rate, and training behavior. By systematically evaluating different configurations, hyperparameter tuning helps improve model performance, balance the bias-variance trade-off, reduce overfitting, and enhance generalization to unseen data.
Hyperparameters vs. Parameters: A Critical Distinction
Before exploring tuning strategies, it is essential to understand exactly what a hyperparameter is and how it differs from the model parameters that training algorithms automatically optimise.
Model Parameters: Learned from Data
Model parameters are the values that a training algorithm adjusts to minimise the loss function during training. They are internal to the model and are determined by the data.
• Neural network weights and biases: Adjusted by backpropagation to minimise the training loss.
• Linear regression coefficients: Determined by the ordinary least squares solution or gradient descent.
• SVM support vectors and margins: Defined by the training algorithm based on the data distribution.
You do not set these directly; the training algorithm finds them for you. The goal of training is to find parameter values that minimise the loss on training data.
Hyperparameters: Set Before Training
Hyperparameters, by contrast, are configuration choices made before training begins. They govern how the model is structured and how training proceeds:
- Learning rate: Controls the step size in gradient descent. Too high: training diverges. Too low: training is extremely slow.
- Number of layers and neurons: Determines the capacity of a neural network.
- Max depth and n_estimators: Control the complexity of tree-based models like Random Forest and Gradient Boosting.
- Regularisation strength (C, lambda): Penalises model complexity to prevent overfitting.
- Batch size: The number of training examples used per gradient update step.
- Dropout rate: The fraction of neurons randomly disabled during training to prevent co-adaptation.
The central challenge is that there is no universal formula for setting hyperparameters optimally. Their ideal values depend on the dataset, the algorithm, the hardware, and the performance objective, which is why systematic tuning is necessary.
Why Hyperparameter Tuning Matters
Hyperparameter tuning is not optional, a polish applied after a model already works. It is a critical determinant of whether the model works at all and how well it generalises.
Overfitting and Underfitting
The fundamental challenge in supervised learning is the bias-variance trade-off: a model must be complex enough to learn the patterns in the training data, but not so complex that it memorises noise instead of learning generalisable patterns.
- Underfitting (high bias): The model is often too simple because hyperparameters like max depth or number of neurons are set too low. It performs poorly on both training and validation data.
- Overfitting (high variance): The model is too complex, often because regularisation is too weak, the model has too many parameters, or it has been trained for too many epochs. It performs well on training data but poorly on validation data.
Hyperparameter tuning navigates between these extremes. Regularisation hyperparameters control model complexity; depth and size hyperparameters control capacity; learning rate and epochs control how thoroughly the model fits the data. Getting them right is the difference between a model that generalises and one that does not.
The Impact on Production Performance
In research benchmarks, hyperparameter differences routinely account for 5–15% differences in accuracy on the same dataset with the same algorithm. In production, these differences translate directly into business outcomes: a credit model with 3% better AUC catches more fraud meaningfully; a recommendation model with 5% better precision drives measurably higher engagement. Tuning is not theoretical; its effects are economically significant.
Research published in the Journal of Machine Learning Research (JMLR) highlighted an important lesson for AI practitioners: carefully tuned hyperparameters can sometimes deliver performance improvements comparable to entirely new algorithmic innovations. Parameters such as learning rates, regularization strength, batch size, and optimizer settings often have a major impact on model quality. The findings reinforced the idea that successful machine learning depends not only on choosing the right model architecture, but also on systematic hyperparameter optimization, experimentation, and reproducible training practices.
Cross-Validation: The Foundation of Reliable Tuning
Before exploring tuning strategies, it is essential to understand cross-validation, because the quality of hyperparameter tuning depends entirely on the quality of performance estimation.
Why a Single Validation Set Is Insufficient
If you evaluate hyperparameter configurations on a fixed validation set, you risk overfitting to that specific validation set. The configuration that scores best may have been lucky, producing good results by chance on that particular sample rather than because it genuinely generalises better.
Cross-validation addresses this by evaluating each hyperparameter configuration across multiple validation folds, producing a more robust and reliable performance estimate.
K-Fold Cross-Validation
K-fold cross-validation splits the training data into k equal folds. For each configuration, the model is trained k times, each time using k-1 folds for training and the remaining fold for validation. The average validation score across all k folds is the configuration’s cross-validation score.
Common choices for k are 5 and 10. Larger k produces lower-variance estimates but requires more computation. Smaller k is faster but produces noisier estimates. For small datasets, k=10 is typical; for large datasets, k=3 or k=5 may be sufficient.
Stratified K-Fold
For classification problems with class imbalance, stratified k-fold ensures that each fold contains approximately the same proportion of each class as the full dataset. This prevents folds where a minority class is absent from the validation set entirely, which would produce unreliable performance estimates for that class.
Grid Search: Exhaustive Hyperparameter Search
Grid search is the simplest and most intuitive hyperparameter tuning strategy. It exhaustively evaluates every possible combination of hyperparameter values in a predefined grid.
How GridSearchCV Works
In scikit-learn, GridSearchCV automates the entire grid search process. The practitioner defines a parameter grid as a dictionary mapping hyperparameter names to lists of candidate values. GridSearchCV trains and evaluates the model for every combination of values in the grid using cross-validation, then returns the combination with the highest cross-validation score.
For a Random Forest with a grid of {n_estimators: [100, 200, 300], max_depth: [5, 10, 15], min_samples_split: [2, 5, 10]}, GridSearchCV evaluates 3 × 3 × 3 = 27 combinations. With 5-fold cross-validation, this requires 27 × 5 = 135 model training runs.
Strengths of Grid Search
- Exhaustive coverage: Grid search guarantees that every specified combination is evaluated, making it unlikely to miss the best configuration within the defined grid.
- Reproducibility: Because the search space is completely defined upfront, results are fully reproducible with the same random seed.
- Simplicity: Easy to implement and interpret, the parameter grid is explicit and transparent.
Limitations of Grid Search
- Exponential scaling: Adding one more hyperparameter or one more value per parameter multiplies the total number of evaluations. For 6 hyperparameters each with 5 values, the grid has 5^6 = 15,625 combinations computationally prohibitive.
- Inefficient exploration: Grid search allocates equal computational budget to each combination, even to regions of the parameter space that are clearly suboptimal after only a few evaluations.
- Continuous parameters require discretisation: Learning rate, for example, is a continuous value. Grid search requires specifying discrete values, potentially missing the true optimum between grid points.
Grid search is best suited for small parameter spaces (2–3 hyperparameters, 3–5 values each) where exhaustive evaluation is computationally feasible.
Randomised Search: Efficient Sampling
Randomised search, introduced by Bergstra and Bengio in their landmark 2012 paper, addresses the exponential scaling problem of grid search by sampling a fixed number of hyperparameter configurations at random from a defined distribution rather than exhaustively testing every combination.
How RandomizedSearchCV Works
In scikit-learn, RandomizedSearchCV accepts a parameter distribution, which can include continuous distributions (scipy.stats.loguniform, scipy.stats.randint) rather than discrete lists. It samples n_iter configurations randomly from this distribution, evaluates each with cross-validation, and returns the best-performing configuration.
The key insight from Bergstra and Bengio’s research: if only a few hyperparameters significantly affect performance (which is often one or two parameters dominating), then random search allocates more evaluations to those important parameters than grid search does, because random search is not constrained to a regular grid pattern.
Choosing n_iter
The primary decision in randomised search is n_iter, the number of configurations to evaluate. A practical starting point is 20–60 iterations for most problems. More iterations improve the chance of finding the optimum, but increase computation time linearly. The right choice depends on the available compute budget and the sensitivity of performance to hyperparameter choice.e
Bayesian Optimisation: Intelligent Search
Both grid search and randomised search treat each hyperparameter evaluation independently; they do not use the results of previous evaluations to guide where to search next. Bayesian optimisation takes a fundamentally different approach: it learns from every evaluation to build a model of the performance surface and uses that model to decide where to evaluate next.
How Bayesian Optimisation Works
Bayesian optimisation maintains a probabilistic surrogate model, typically a Gaussian Process or Tree Parzen Estimator, that approximates the relationship between hyperparameter values and model performance. At each iteration:
- Evaluate the current best guess: Train and evaluate the model with the hyperparameter configuration suggested by the acquisition function.
- Update the surrogate model: Incorporate the new result into the surrogate, refining its estimate of the performance surface.
- Select the next configuration: Use an acquisition function to identify the hyperparameter configuration most likely to improve on the current best — balancing exploration (regions of high uncertainty) with exploitation (regions of known high performance).
Why Bayesian Optimisation Outperforms Random Search
Because Bayesian optimisation uses all previous evaluations to inform each new one, it focuses computational effort on the most promising regions of the hyperparameter space. In practice, Bayesian optimisation typically finds configurations comparable to those from extensive random or grid search in 3–5 times fewer evaluations, a decisive advantage for expensive models (deep neural networks, large ensemble methods) where each evaluation takes hours.
AutoML and Modern Tuning Frameworks
Modern hyperparameter tuning has moved well beyond manual grid and random search. A mature ecosystem of frameworks automates and accelerates the tuning pipeline for a production machine learning workflow.s
Optuna
Optuna is a Python-native hyperparameter optimisation framework built around a define-by-run API, where hyperparameter sampling is integrated directly into the training function rather than defined upfront as a static grid. Optuna uses Tree Parzen Estimator (TPE) Bayesian optimisation by default and supports advanced features, including:
• Pruning terminating unpromising trials early to save compute.
• Distributed optimisation parallelising trials across multiple machines.
• Multi-objective optimisation simultaneously optimising multiple metrics.
Optuna is one of the most widely adopted tuning frameworks in the Python ML community, compatible with any training framework, including scikit-learn, PyTorch, XGBoost, and LightG.BM.
Ray Tune
Ray Tune is a distributed hyperparameter tuning library built on Ray, designed for large-scale experiments. It supports all major search algorithms (grid, random, Bayesian, population-based training) and schedulers (Hyperband, ASHA) and scales from a single machine to thousands of nodes without code changes. Ray Tune is the preferred choice for deep learning hyperparameter tuning, where individual trials may take hours and parallelisation across GPU clusters is required.
H2O AutoML and Auto-sklearn
Full AutoML platforms go beyond hyperparameter search to automate the entire model selection and tuning pipeline:
- H2O AutoML: Automatically trains and tunes multiple algorithm families (GBM, Random Forest, neural networks, GLM, stacked ensembles) and returns a leaderboard of best-performing models with their hyperparameter configurations.
- Auto-sklearn: Built on scikit-learn, Auto-sklearn uses Bayesian optimisation to search over both algorithm selection and hyperparameter configuration simultaneously,y finding the best pipeline from preprocessing through model selection and tuning in a single automated workflow.
Learning Rate Tuning and Practical Best Practices
Among all hyperparameters, the learning rate has the greatest impact on model performance in gradient-based training. Understanding how to tune it efficiently and the broader best practices of hyperparameter tuning separate good practitioners from great ones
The Learning Rate Finder
The learning rate finder, popularised by fast.ai, identifies a good initial learning rate by running a short training run with an exponentially increasing learning rate. The optimal learning rate is approximately where the training loss decreases most steeply just before it starts to increase. This technique avoids the need to search for learning rates blindly and provides a data-driven starting point for more refined tuning.
Learning Rate Scheduling
A fixed learning rate is rarely optimal throughout training. Learning rate scheduling reduces the learning rate over time, allowing the model to make large initial updates and fine-grained adjustments later. Common schedules include:
- Step decay: Reduces the learning rate by a fixed factor every N epochs.
- Cosine annealing: Smoothly reduces the learning rate following a cosine curve, often with warm restarts.
- Cyclical learning rates: Oscillate between a minimum and maximum learning rate, allowing the model to escape local minima.
Practical Tuning Best Practices
- Start with algorithm defaults: Default hyperparameters are often well-chosen by algorithm authors. Understand why the defaults exist before changing them.
- Tune the most impactful hyperparameters first: For most algorithms, 1–2 hyperparameters account for the majority of performance variation. For tree models: n_estimators and max_depth for neural networks: learning rate and architecture size.
- Use logarithmic scales for scale-sensitive parameters: Learning rate and regularisation strength should be searched on a log scale (e.g., 1e-4 to 1e-1) rather than linearly.
- Log all experiments: Use MLflow, Weights and Biases, or Optuna’s built-in logging to record every trial’s hyperparameters and results. Experiment tracking makes it possible to reproduce results and learn from patterns across trials.
- Reserve a test set: Never use the test set during tuning. It must remain unseen until final evaluation; otherwise, repeated evaluation on the test set effectively turns it into a validation set, producing optimistic performance estimates.
If you want practical experience working with activation functions, neural networks, and deep learning models, HCL GUVI’s AI and ML programs can help you understand how concepts like sigmoid, backpropagation, and gradient descent are implemented using frameworks such as TensorFlow and PyTorch through hands-on projects.
Conclusion
Hyperparameter tuning is one of the highest-leverage activities in machine learning. The same algorithm, the same data, and the same hardware can produce dramatically different results depending on how the hyperparameters are configured — and the difference between a mediocre model and a production-ready one often comes down to this single step.
The toolbox is well-developed. Grid search provides exhaustive coverage for small parameter spaces. Randomised search scales efficiently to larger spaces and continuous parameters. Bayesian optimisation directs the search intelligently, finding strong configurations with far fewer evaluations than either search alternative. Modern frameworks like Optuna, Ray Tune, and H2O AutoML bring these capabilities to production workflows with parallelisation, pruning, and full experiment tracking.
The practitioner who understands not just how these techniques work but when to apply each one and who pairs systematic tuning with reliable cross-validation and disciplined experiment tracking consistently builds models that perform better, generalise further, and hold up under the pressures of real-world deployment.
FAQs
1. What is the difference between hyperparameters and parameters?
Parameters are learned by the training algorithm from data such as neural network weights or linear regression coefficients. Hyperparameters are configuration settings defined by the practitioner before training begins — such as learning rate, max depth, or regularisation strength — that govern how the model is structured and how training proceeds.
2. When should I use Grid Search vs. Randomised Search?
Use Grid Search when the hyperparameter space is small (2–3 parameters, 3–5 values each) and exhaustive evaluation is computationally feasible. Use Randomised Search when the space is larger or includes continuous parameters, which achieves comparable results with a fixed, predictable computational budget regardless of search space size.
3. What makes Bayesian optimisation better than random search?
Bayesian optimisation uses all previous trial results to build a probabilistic model of the performance surface and selects the next configuration intelligently, focusing on promising regions. This means it typically finds better configurations in fewer evaluations than random search, making it far more efficient for expensive models.
4. What is cross-validation’s role in hyperparameter tuning?
Cross-validation provides a reliable, low-variance estimate of how well a hyperparameter configuration will perform on unseen data. Without cross-validation, evaluating configurations on a single fixed validation set risks overfitting the hyperparameters to that specific sample, producing overly optimistic performance estimates.
5. Can AutoML replace manual hyperparameter tuning?
AutoML platforms automate the search process effectively and often match expert-level tuning on standard tasks. However, they require compute budgets, can be opaque in their decisions, and may not capture domain-specific constraints. Manual tuning remains valuable for understanding model behaviour and imposing task-specific requirements.



Did you enjoy this article?