Research2024.04.20

Genetic Algorithms for Hyperparameter Tuning in Machine Learning

Genetic Algorithms · Machine Learning · Hyperparameter Tuning

The hyperparameter optimization problem

Most machine learning models expose settings that are not learned from data during training but are fixed before it: the learning rate and weight decay of a neural network, the depth and minimum leaf size of a decision tree, the kernel and regularization constant of a support vector machine. These hyperparameters interact, and the mapping from a configuration to validation performance has no closed form. Hyperparameter optimization (HPO) is the problem of finding the configuration that minimizes validation loss, and it is hard for reasons that are structural rather than incidental.

The objective is an expensive black box. Each evaluation means training a model to convergence, which can take minutes for a small classifier or days for a large network, so the search budget is measured in tens or hundreds of evaluations rather than millions. The objective is non-differentiable with respect to the hyperparameters, ruling out gradient descent over the search space directly. The space itself is awkward: it mixes continuous parameters (learning rate), integer ones (number of layers), and categorical ones (optimizer choice), and it is often conditional, where a parameter only exists if another takes a particular value — the momentum coefficient is meaningless unless the optimizer is SGD with momentum. Finally, the objective is noisy: re-running the same configuration with a different random seed gives a different validation score, so the function being optimized is a stochastic one.

Baselines: grid, random, and Bayesian search

The default method is grid search: discretize each hyperparameter and evaluate every combination. It is trivial to implement and to parallelize, but the number of evaluations grows exponentially with the number of hyperparameters, and it spends its budget evenly across dimensions regardless of which ones matter.

Bergstra and Bengio's 2012 analysis is the standard reference for why this is wasteful. They showed, both empirically and through a simple argument about effective dimensionality, that random search over the same domain finds models as good as or better than grid search within a small fraction of the computation. The intuition is that for most problems only a few hyperparameters meaningfully affect performance; a grid wastes trials by repeating the same values of the important parameters while varying unimportant ones, whereas random sampling gives each important parameter many distinct values for the same budget. Random search is the honest baseline that any more sophisticated method should be measured against.

Bayesian optimization is the most common model-based alternative. It fits a probabilistic surrogate — classically a Gaussian process, as in Snoek, Larochelle, and Adams's 2012 treatment — to the observations seen so far, then uses an acquisition function to choose the next configuration that best trades off exploiting promising regions against exploring uncertain ones. It is sample-efficient, often finding good configurations in far fewer evaluations than random search, but it is largely sequential by construction, and standard Gaussian process surrogates handle continuous spaces more naturally than conditional or high-cardinality categorical ones.

How genetic algorithms work

A genetic algorithm (GA) borrows its structure from natural selection. A configuration is encoded as a genome — a vector or tree whose components, the genes, are the individual hyperparameters. The algorithm maintains a population of such genomes and improves it across generations. Fitness is simply the validation metric obtained by training a model with that configuration; higher fitness means the individual is more likely to contribute to the next generation.

Each generation proceeds in a few steps. Selection draws parents in proportion to fitness, commonly through tournament selection, where a few individuals are sampled at random and the best of them is chosen. Crossover recombines two parents into a child by mixing their genes, so a child might inherit one parent's learning rate and the other's network depth. Mutation then perturbs some genes at random — nudging a continuous value, flipping a categorical choice — which injects the variation that lets the search reach regions no parent occupied. Elitism carries the best individuals through unchanged so that good solutions are never lost to unlucky recombination. The new population is evaluated, and the loop repeats until the budget is spent. In prose this is the whole algorithm: initialize a population at random, evaluate fitness, select parents, apply crossover and mutation to produce offspring, keep the elite, and iterate.

I used the DEAP library for this when I was doing this work as an undergraduate; it exposes selection, crossover, and mutation as composable operators rather than hiding them inside a black box, which makes the encoding decisions explicit. The encoding and the mutation schedule are where most of the design effort goes. Mutate too aggressively and the method degenerates into random search; mutate too timidly and the population converges prematurely into the basin it started in.

Why genetic algorithms fit HPO

The properties that make HPO hard are the ones GAs tolerate well. Because fitness is evaluated by training a model and reading off a number, the method never needs a gradient, so non-differentiability is a non-issue. The genome can hold continuous, integer, and categorical genes side by side, and conditional structure is handled by encoding it directly — a tree representation can omit a subtree when the parameter that gates it is inactive — so mixed and conditional spaces do not require special surrogate machinery.

Two further properties matter in practice. A GA evaluates a whole population each generation, and those evaluations are independent, so the expensive part of the search is embarrassingly parallel: with enough machines, a generation of fifty configurations costs about as much wall-clock time as one. And because the search is population-based rather than following a single trajectory, it explores several regions at once, which makes it less prone to getting stuck in a single local optimum than a method that commits early to one promising area.

Evidence from real systems

Evolutionary methods are not a toy; they have produced competitive results on hard problems. The clearest example is neural architecture search. Real, Aggarwal, Huang, and Le's 2019 "regularized evolution" introduced an aging variant of tournament selection — older individuals are retired regardless of fitness — and evolved AmoebaNet-A, which when scaled up reached 83.9% top-1 and 96.6% top-5 accuracy on ImageNet, the first time an evolved image classifier surpassed hand-designed networks at that scale. The same paper found that evolution reached good architectures at least as fast as reinforcement learning in their controlled comparison.

Closer to everyday HPO, the TPOT system of Olson and Moore uses genetic programming to evolve entire scikit-learn pipelines — feature preprocessors, model choices, and their hyperparameters together — represented as trees, and automatically discovers pipelines that improve on a basic baseline analysis without manual tuning. Loshchilov and Hutter applied CMA-ES, an evolution strategy that adapts a covariance matrix over the search distribution, to tune a convolutional network on MNIST across 30 GPUs in parallel, using the method's natural parallelism to compete with Bayesian optimization. These systems differ in their operators, but they share the evolutionary template: a population, a fitness signal, and variation under selection.

Trade-offs against Bayesian and multi-fidelity methods

The honest comparison is with Bayesian optimization and with bandit-based multi-fidelity methods, and the trade-off is roughly sample efficiency against parallelism. Bayesian optimization tends to find good configurations in fewer total evaluations because each choice is informed by a model of everything seen so far, but that same dependence makes it sequential and slower to spread across many workers. A GA is less sample-efficient — it learns nothing explicit between generations beyond which individuals survived — but it parallelizes cleanly, so when evaluations are cheap to run concurrently and machines are plentiful, wall-clock time can favor the GA even if it uses more total compute.

Multi-fidelity methods attack the cost of evaluation itself. Hyperband, from Li, Jamieson, DeSalvo, Rostamizadeh, and Talwalkar, treats HPO as a bandit problem: it allocates a small budget to many random configurations, keeps the best, and progressively gives survivors more resources through successive halving, reporting more than an order-of-magnitude speedups over competitors on several problems. BOHB, from Falkner, Klein, and Hutter, replaces Hyperband's random sampling with a Bayesian model so the method has both strong anytime behavior and fast final convergence. These ideas are not in opposition to evolution; the same early-stopping and partial-budget evaluation can be folded into a GA, so that cheap, short training runs filter the population before expensive full evaluations are spent on survivors. The pragmatic reading is that GAs win when the space is awkward and parallelism is abundant, and lose when the evaluation budget is tight and the space is well-behaved enough for a surrogate model to learn quickly.

Practical guidance

A few choices carry most of the weight. Population size sets the breadth of exploration per generation and the granularity of parallelism; a population well below the search space's dimensionality converges too early, while a very large one wastes evaluations. Mutation rate governs the balance between exploration and exploitation, and a schedule that starts higher and decays as the population converges tends to behave better than a fixed rate. Elitism of a small fraction protects the best solutions without freezing the population. The total evaluation budget — population size times generations — should be set against what random search would achieve with the same number of trials, since that is the baseline the method has to beat.

Two failure modes deserve attention. Fitness is noisy because of seed variance, so a configuration can look better than it is by luck; evaluating promising individuals over a few seeds, or averaging folds, keeps the selection pressure pointed at real signal rather than noise. And full fitness evaluations are expensive, so fitness approximation — training for fewer epochs, on a data subset, or with early stopping — lets the population be filtered cheaply, with full-budget evaluation reserved for the candidates that survive. The encoding is worth designing deliberately: representing conditional parameters so that inactive genes are genuinely inert keeps crossover and mutation from spending effort on settings that do nothing.

Conclusion

Genetic algorithms are a pragmatic choice for hyperparameter optimization when the search space is mixed, conditional, or high-dimensional in ways that frustrate surrogate models, and when evaluations can be run in parallel across many machines. They are gradient-free, they handle discrete and conditional structure without special machinery, and their population-based search resists premature convergence. They are not the right tool when the evaluation budget is small and the space is smooth enough that Bayesian optimization can exploit a model of it, or when a multi-fidelity method can cut evaluation cost more directly. In practice the strongest setups borrow from both worlds — combining evolution's parallel, structure-tolerant search with early-stopping and partial-budget evaluation — rather than treating the choice as either-or.

References

01Random Search for Hyper-Parameter Optimization — Bergstra, J. & Bengio, Y. (2012). Journal of Machine Learning Research 13:281-305
02Practical Bayesian Optimization of Machine Learning Algorithms — Snoek, J., Larochelle, H. & Adams, R. P. (2012). Advances in Neural Information Processing Systems 25 (NeurIPS)
03Regularized Evolution for Image Classifier Architecture Search — Real, E., Aggarwal, A., Huang, Y. & Le, Q. V. (2019). Proceedings of the AAAI Conference on Artificial Intelligence
04TPOT: A Tree-based Pipeline Optimization Tool for Automating Machine Learning — Olson, R. S. & Moore, J. H. (2016). ICML 2016 AutoML Workshop, PMLR 64:66-74
05CMA-ES for Hyperparameter Optimization of Deep Neural Networks — Loshchilov, I. & Hutter, F. (2016). arXiv:1604.07269 (ICLR 2016 Workshop Track)
06Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization — Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A. & Talwalkar, A. (2017). Journal of Machine Learning Research 18:1-52
07BOHB: Robust and Efficient Hyperparameter Optimization at Scale — Falkner, S., Klein, A. & Hutter, F. (2018). Proceedings of the 35th ICML, PMLR 80:1436-1445
08DEAP: Evolutionary Algorithms Made Easy — Fortin, F.-A., De Rainville, F.-M., Gardner, M.-A., Parizeau, M. & Gagné, C. (2012). Journal of Machine Learning Research 13:2171-2175

← All writing