A tour of AutoML

7 min readMar 6, 2023

Main source

“Automated machine learning: Review of the state-of-the-art and opportunities for healthcare” by Jonathan Waring, Charlotta Lindvall, Renato Umetona

https://www.researchgate.net/publication/339417976_Automated_Machine_Learning_Review_of_the_State-of-the-Art_and_Opportunities_for_Healthcare

Background

Due to the “no free lunch” theorem, a large effort from human experts is needed to test different algorithms in order to achieve good performance
Depending on the industry, data and human expertise are not always readily available, which makes ML solution deployment difficult
AutoML’s goal is to automatically optimize several sections of an ML pipeline
AutoML can be split into 4 distinct categories: automated feature engineering, hyperparameter optimization, pipeline optimizers (addresses more than one component), and neural architecture search

Automated feature engineering

Feature engineering is one of the most time-consuming tasks in the creation of an ML solution. It requires domain knowledge to be done efficiently and has a strong impact on the final model’s performance; hence it is clearly one of the main critical aspects of an ML pipeline. Several automated feature engineering frameworks have emerged over the last few years to simplify the process.

Expand-reduce

Introduced by Kanter et al., this method works by applying all feature transformations at once to obtain m features × k transformation functions. This is then followed by a feature selection step and hyperparameter tuning. Everything is done in one modeling step (feature selection excluded) but an obvious drawback is a “performance bottleneck in the feature selection step, given the large number of features to consider”. Since its original development, several versions of this technique have been developed (see ExploreKit, the One Button Machine (OBM), and most recently AutoLearn)

Genetic programming

“Genetic programming is a form of artificial intelligence that mimics natural selection in order to find an optimal result. Genetic programming is iterative, and at each new stage of the algorithm, it chooses only the fittest of the “offspring” to cross and reproduce in the next generation, which is sometimes referred to as a fitness function. Just like in biological evolution, evolutionary algorithms can sometimes have randomly mutating offspring, but since only the offspring that have the highest fitness measure are reproduced, the fitness will almost always improve over generations” (source: https://www.virtusa.com/digital-themes/genetic-programming). “Tran et al. proposed genetic programming using a tree-based representation that can be used for both feature construction and implicit feature selection. While this method did provide favorable results in certain experiments, it also provides an unstable solution in which overfitting frequently occurs”.

Hierarchical Greedy search

“Khurana et al. proposed the “Cognito” system which explores various feature construction choices in a hierarchical manner, while progressively maximizing the accuracy of the model through a greedy search strategy. This is done by constructing a directed acyclic transformation graph and applying transformations to all valid input features”; so the dataset is transformed (for example with sqrt function) at a given step and we compare the accuracy against another transformation function’s accuracy.

Other techniques: Meta learning, Reinforcement learning

The same authors created other techniques, one using reinforcement learning and one called LFE (Learning Feature Engineering) which uses meta learning. “LFE works by learning how effective a given transformation (e.g., arithmetic or aggregate operators) on numerical features truly is by learning from past feature engineering experiences. Given a new dataset, LFE recommends a set of useful transformations to be applied without model evaluation or an explicit feature expansion/selection step. This approach uses a substantially lower amount of computational resources and was shown to improve upon previous feature engineering approaches on a majority of the datasets it was tested on” (see https://www.ijcai.org/proceedings/2017/0352.pdf for the academic paper).

Hyperparameter optimization

The most basic task of AutoML is to automatically set hyperparameters (manually set prior to training) to optimize model performance. “Hyperparameter optimization is often considered an “art”, requiring practitioner’s experience, general rules of thumb, and sometimes just a brute-force search”. Classic techniques are grid search where we don’t make any assumption about the search space (taking a brute force approach) and random search where we sample random hyperparameters configurations.

Particle Swarm Optimization (PSO) and evolutionary algorithms

PSO belongs to the “optimization from samples” techniques and consists in guided searches that iteratively generate new configurations based on the previous performance of prior configurations. “PSO works by updating the configuration space at each iteration by moving the solution towards the best individual configurations and searching the neighboring configurations in later iterations”.

Evolutionary algorithms are inspired by “biological evolution, and work by maintaining a population (configuration space) and improve the population by applying mutations (small perturbations) and crossover (combining individual solutions) to obtain a “generation” of better configurations. One of the best implementations of these population-based methods is the covariance matrix adaption evolution strategy (CMA-ES), which samples configurations from a multivariate Gaussian distribution whose mean and covariance are updated in each generation”.

Bayesian optimization framework

Considered the state-of-the-art optimization framework for AutoML systems, “Bayesian optimization is a probabilistic, iterative algorithm with two main components: a surrogate model and an acquisition function. Bayesian optimization builds a probabilistic surrogate model, usually in the form of a Gaussian process or a tree-based model, which is used to map the different hyperparameter configurations to their performance with some measure of uncertainty. Using this surrogate model, an acquisition function is then defined to determine the potential utility of a given configuration, and therefore balance exploration and exploitation during the search process” (see including Hyperopt, SMAC, Spearmint, Hyperas, and Talos for implementation).

Pipeline optimizers

When we think of autoML, tools like H2O, Data Robot or Google autoML come to mind. Those tools have the particularity to automate several tasks and their usage is not as narrow as the previous techniques we saw. Pipeline Optimizers are systems that perform several tasks to automate a larger part of the ML process.

Auto-WEKA

“Auto-WEKA was the first AutoML system that considered the problem of simultaneously selecting a machine learning algorithm and optimizing its hyperparameters”. It uses Bayesian optimization methods to obtain high-quality results in a reasonable amount of time.

Auto-sklearn

Based on the python library scikit-learn, auto-sklearn “uses a Bayesian Optimization search procedure to efficiently discover a top-performing model pipeline for a given dataset” (see https://machinelearningmastery.com/auto-sklearn-for-automated-machine-learning-in-python/ for tutorial). The interesting part is that it searches across openML repositories for datasets that are similar (data points, data skewness, etc.) and that past performance could be used to boost the efficiency of the model creation process.

It is worth noting that according to the authors, both Weka and auto-sklearn are not “well equipped to handle large datasets”

TPOT

“Tree-based Pipeline Optimization Tool (TPOT) is an open source genetic programming-based AutoML system that is meant to handle feature preprocessing, model selection, and hyperparameter optimization tasks for a given machine learning problem”. TPOT is a wrapper for scikit-learn and algorithm that is present in that library. “TPOT uses a tree-based structure to represent a model pipeline for a predictive modeling problem, including data preparation and modeling algorithms and model hyperparameters. […] An optimization procedure is then performed to find a tree structure that performs best for a given dataset. Specifically, a genetic programming algorithm, designed to perform a stochastic global optimization on programs represented as trees.” (see https://machinelearningmastery.com/tpot-for-automated-machine-learning-in-python/).

Neural architecture search (NAS)

The objective here is to find the neural network configuration. “NAS methods are best categorized by three factors: search space, search strategy, and performance estimation strategy. The search space refers to the potential neural architectures that can be represented by the NAS algorithm, whereas the search strategy refers to how this space is explored. Lastly, the performance estimation strategy refers to how the NAS algorithm evaluates a given architecture’s performance on some task given some training dataset”.

Search space

“The choice of search space significantly determines how difficult the optimization problem of NAS becomes, as optimization in this case is often non-continuous and usually high dimensional, given that more complex models tend to perform better. The simplest of search spaces is that of simple feed-forward neural networks” (so a simple input layer, hidden layers and output layer). Recent NAS works incorporate modern neural design elements, such as skip connections or search for blocks rather than a whole architecture.

Search strategy

Recently, NAS is often seen as a reinforcement learning (RL) problem. With this framework, “the generation of a neural architecture is considered the agent’s action and the action space is the search space. The agent’s reward mechanism is then determined by the performance estimation strategy of the given NAS algorithm”. Another poplar strategy includes using evolutionary algorithms. Moreover, “Auto-Keras […] uses Bayesian optimization and network morphisms, a technique to alter the architecture of a network but keep its functionality, to speed up their search strategy”.

Performance estimation strategy

“In order to guide these previously mentioned search strategies, a NAS algorithm needs a way to measure the performance of a given architecture that is being considered. While it would be very simple to train a given architecture on training data and evaluate its performance on a validation set, this is computationally expensive”.

One way to estimate performance while reducing the training times is to use a subset of the full training data “but these techniques also introduce bias into the performance estimates. This bias isn’t necessarily problematic as long as the relative rankings stay stable, but recent work suggests that this may not always be the case”. Another way is to use “learning curve exploitation, which attempts to predict from the initial learning curves those architectures that will have a poor performance, and terminate those architectures to speed up the search”.