automl, tools,

Using H2O AutoML to simplify training process (and also predict wine quality)

Aug 04, 2020 · 11 mins read
Using H2O AutoML to simplify training process (and also predict wine quality)
Share this

Training a machine learning model can take a lot of your time and effort. If you have already prepared your data and set up the whole environment - finding the best model for your problem is still not that easy. But why won’t you let tools do that for you?

How long does it take to find “the one”?

When you start working on a new task (let’s say fraud detection) collecting and processing data can be time-consuming. Analyzing images or features, correlations between them and distributions. Filling missing values, removing irrelevant features or samples - maybe adding new samples using data augmentation techniques.

But is it the most tedious part of the work? How about training models and fine-tuning hyperparameters? You might want to start with a single and simple model like a decision tree to get baseline score, but eventually you would like to compare tens of models and approaches to pick “the one” at last.

You can use different types of models - trees, regression algorithms or neural networks. Each of them has many parameters than can be configured - maximum height, number of layers, learning rate and many others. Don’t you forget that different approaches may also require different input (e.g. encoded as one-hot vectors).

How much time would you spend to manually configure all these models in your script, add a grid search, test different metrics, cross-validation, ensembles and all that stuff? And why would you do all of that manually in the first place?

A part of machine learning project pipeline

Figure 1. A part of machine learning project pipeline

What is “automatic machine learning”?

Automatic machine learning (or Auto ML) simplifies the whole machine learning pipeline for you. Those tools can take care of both data and models in your machine learning pipeline. AutoML seems to be quite a thing lately and obviously different tools offer various functionalities, but these usually include:

  • Data processing (feature encoding, normalization etc.),
  • Feature engineering (removing or adding new relevant features, filling missing values),
  • Hyperparameter optimization (searching over a parameter grid to find the best performing configuration),
  • Training and model selection (models are trained and then compared on a leaderboard - using metrics),
  • Loading/saving models to a format that can be then deployed and used in a production code.

To me, it already sounds like a lot of saved time. That’s why in this post I describe how I used AutoML from to see how it works, how much time I can save and (most importantly) if Auto ML can really find a good model for me.

Wine Quality Dataset

I picked a relatively simple dataset for this post. The goal is to model wine quality based on data obtained from physicochemical tests. Wine Quality Data Set is now more than 10 years old but it is still a great resource if you need a simple binary classification problem. It has 11 numerical features (see table below) and 4898 samples to work with.

Sample rows from Wine Quality Dataset

Figure 2. Sample rows from Wine Quality Dataset

In this task, I read and use raw-data - no processing or feature engineering. I wrote only two lines to read and prepare my data for training with H2O. Interesingly, H2O has its own implementation of DataFrame (with a Pandas-like interface) which means you don’t need to import another library to only read CSV file 👏

df = h2o.import_file('data/wine.csv') # Read CSV to H2ODataFrame object
df['quality'] = df['quality'].asfactor() # Force H2O to treat this as a classification problem

# Split data frame into training and test subsets
train, test = df.split_frame(ratios=[0.9], seed=7)

A asfactor() method tells H2O that you want to use your data for classification. You don’t have to process the target column (one-hot encoding is not required), instead you just cast the column to “factor” type and that’s it.

Another +1 for H2ODataFrame supporting train-test split without any external libraries. Here I split the data frame into training (90% = 1447 rows) and test (10% = 152 rows) subsets. And now it’s everything you need to start training and validating your models with AutoML which does the rest under the hood.

Training multiple models with H2O AutoML

If loading and processing data in two lines of code was shocking, pay attention now. It turns out that you need just a three lines of code to run AutoML and train bunch of models at once. H2O documentation specifies which models will be trained and cross-validated in the current version of AutoML:

three pre-specified XGBoost GBM (Gradient Boosting Machine) models, a fixed grid of GLMs, a default Random Forest (DRF), five pre-specified H2O GBMs, a near-default Deep Neural Net, an Extremely Randomized Forest (XRT), a random grid of XGBoost GBMs, a random grid of H2O GBMs, and a random grid of Deep Neural Nets.

Moreover, it adds two stacked ensembles (combining multiple models). One of them contains all the models and the second includes best performing models from each algorithm class/family. It’s a good thing, because you may expect stacked ensembles to perform better than single models, usually.

All of this happens within only three lines of code:

from h2o.automl import H2OAutoML

# "nfolds" is a number of folds for cross-validation,
# "max_runtime_secs" sets a time limit for a whole training process,
# "sort_metric" determines which metric will be used to compare models.
aml = H2OAutoML(nfolds=5, seed=7, max_runtime_secs=600, sort_metric='logloss')

aml.train(y='quality', training_frame=train)

A third (kind) surprise is here. Contrary to other popular frameworks (at least in Python), you don’t need to split your data into features x and labels y. Use the same data frame and only specify which column should be used as a target.


If you want to add more configuration to H2OAutoML, it has other interesting parameters, e.g.:

  • x if you want to use only a subset of feature columns for training,
  • validation_frame if you have your validation subset (and you don’t want to run CV),
  • balance_classes to oversample the minority classes (to balance class distribution),
  • stopping_metric to use for early stopping (default is logloss for classification and deviance for regr.),
  • include/exclude_algos if you want to modify the list of models to be build using AutoML.

To sum up, I wrote six lines of code so far and waited for 600 seconds. Now let’s see what is the output and what are the models that AutoML prepared for me. Results come in a form of leaderboard where you can compare models sorted by metric of your choice (sort_metric).


After 10 minutes, when I had my models ready, I couldn’t wait to check leaderboard output:

h2o.automl.get_leaderboard(aml, ['training_time_ms'])

Leaderboard returned a total of 29 (in my case) models sorted according to sort_metric=logloss. Each row consists of a model id and metric values (I couldn’t find a way to inject custom metrics though) and an additional column called training_time_ms that can be added optionally.

H2O AutoML Leadeborard (9 out of 29 rows)

Figure 3. H2O AutoML Leadeborard (9 out of 29 rows)

The best of returned model is a StackedEnsemble_BestOfFamily (containing best models from each family - XRT, GBM etc.). Another StackedEnsemble is at the third place, which makes sense - maybe including weak models decreased ensemble certainty of its predictions. Surprisingly, 2nd place belongs to a single XRT model.

However, if you look closer at values of logloss and AUC - all three top models are pretty close. Actually, stacked ensemble of all models has a slightly better AUC score than 2nd place XRT, but it was less certain of its predictions (lower log loss) and finished 3rd.

Either way, the table seems legit and top models seem to have good scores - but it would be good to compare them with something else. What if we just take a simple model (one of these), train it without any AutoML magic and it turns out to be much better?

Comparing models

To see whether results from the leaderboard are actually good, I’ll train a single model from H2O library e.g. H2OGradientBoostingEstimator (not using AutoML) with its default config to compare it with models returned by AutoML tool. Note that I use the same seed, number of folds and training set to be fair.

from h2o.estimators import H2OGradientBoostingEstimator

gbm_model = H2OGradientBoostingEstimator(nfolds=5, seed=7, keep_cross_validation_predictions=True)
gbm_model.train(y='quality', training_frame=train)

# xval corresponds to AUC column in leaderboard
print(f'loss = {gbm_model.logloss(xval=True)}, auc = {gbm_model.auc(xval=True)}')

loss = 0.4807083065456677, auc = 0.8500663129973475

Quite good results for a default configuration and three lines of code. In the leadeboard such score would end up at 15th place - more or less in the middle. You can see it compared to other models in the figure below. Green-ish line shows the result of H2OGradientBoostingEstimator, whereas orange color shows the average score of all models in the leaderboard.

Loss and AUC metric for 
H2O models

Figure 4. Loss and AUC metric for H2O models

Using AutoML to find the best performing approach requires similar amount of code as training a single model and in the end we get a much better results with AutoML. I believe it’s totally worth it then, unless it doesn’t perform well on a separate test set I created earlier…

gbm_test = gbm_model.model_performance(test)
aml_test = aml.leader.model_performance(test)

print(f'Single-model = {gbm_test.auc()}')
print(f'AutoML leader = {aml_test.auc()}')

Single-model = 0.8779220779220779
AutoML leader = 0.908251082251083

…but it does.


I’m not sure if AutoML tools is something that you may decide to use in a production code (maybe you still want to finetune these models a bit or you want to use models not implemented in H2O/another AutoML tool), but at least they can give you some insights i.e. which one of these algorithms perform the best on your data.

Automatic machine learning can also be a perfect choice for non-experts that just want to use some model in their existing products. If you’re lucky and such tool can find a model that is good enough for you, you don’t need to learn how to use more complex ML frameworks.

As for H2O as a library, I like that it’s different than other tools in few aspects (built-in train/test split, using single frame for features and labels) and I believe there are more kind surprises waiting. I think I wouldn’t use it for bigger projects that involve designing custom models, but it may be awesome for simple projects (like the one I described).

It also has a decent documentation although there are few things I couldn’t find there so I skipped them in my project. Nevertheless, similarly to PyTorch Lightning I believe that the community is still growing and the best is yet to come 😊