avatarKyosuke Morita

Summary

NGBoost, a new boosting algorithm for probabilistic prediction, is compared to LightGBM and XGBoost in terms of performance and features.

Abstract

The article discusses the new algorithm NGBoost, which uses Natural Gradient Boosting for probabilistic predictions. The algorithm consists of base learners, parametric probability distribution, and scoring rules. The performance of NGBoost is compared to LightGBM and XGBoost using the house price prediction dataset from Kaggle. The results show that NGBoost outperforms the other two algorithms in terms of RMSE. However, NGBoost has a longer computing time and lacks some features like early stopping and intermediate result display. The article concludes that NGBoost is a competitive algorithm for probabilistic prediction but needs further improvements.

Bullet points

  • NGBoost is a new boosting algorithm for probabilistic prediction.
  • NGBoost uses Natural Gradient Boosting, which consists of base learners, parametric probability distribution, and scoring rules.
  • The performance of NGBoost is compared to LightGBM and XGBoost using the house price prediction dataset from Kaggle.
  • NGBoost outperforms LightGBM and XGBoost in terms of RMSE.
  • NGBoost has a longer computing time and lacks some features like early stopping and intermediate result display.
  • NGBoost is a competitive algorithm for probabilistic prediction but needs further improvements.

Data Science in the Real World

NGBoost Explained

Comparison to LightGBM and XGBoost

Photo by James Pond on Unsplash

Stanford ML Group recently published a new algorithm in their paper, [1] Duan et al., 2019 and its implementation called NGBoost. This algorithm includes uncertainty estimation into the gradient boosting by using the Natural gradient. This post tries to understand this new algorithm and comparing with other popular boosting algorithms, LightGBM and XGboost to see how it works in practice.

Content

  1. What is Natural Gradient Boosting anyways?
  2. Empirical validation — comparison to LightGBM and XGBoost
  3. Conclusion
  4. What is Natural Gradient Boosting anyways?

As I wrote in the intro, NGBoost is a new boosting algorithm, which uses Natural Gradient Boosting, a modular boosting algorithm for probabilistic predictions. This algorithm is consist of base learner, parametric probability distribution, and scoring rule. I will briefly explain what are those terms.

From the paper, Duan, et at., 2019
  • Base learners

This algorithm uses base (weak) learners. It takes inputs x and outputs are used to form the conditional probability. Those base learners use scikit-learn’s Decision Tree for a tree learner and Ridge regression for a linear learner.

  • Parametric probability distribution

Parametric probability distribution is a conditional distribution. This is formed by an additive combination of base learner outputs.

  • Scoring Rule

A scoring rule takes a predicted probability distribution and one observation of the target feature to produce a score to the prediction, where the true distribution of the outcomes gets the best score in expectation. This algorithm uses MLE (Maximum Likelihood Estimation) or CRPS (Continuous Ranked Probability Score).

We just went through the basic concepts of NGBoost. I definitely recommend you to read the original paper for further understanding (it’s easier to understand the algorithms with math notations).

2. Empirical Validation — Comparison to LightGBM and XGBoost

Photo by billy lee on Unsplash

Let’s implement NGBoost and see how is the performance of it. The original paper also did some experiments on various datasets. They compared MC dropout, Deep Ensembles and NGBoost in regression problems and NGBoost shows its quite competitive performance. In this blog post, I would like to show the model performance on the famous house price prediction dataset on Kaggle. This dataset consists of 81 features, 1460 rows and the target feature is the sale price. Let’s see NGBoost can handle these conditions.

Distribution of the target feature

As testing the performance of the algorithms is the purpose of this post, we will skip a whole feature engineering part and will use Nanashi’s solution.

Import the packages;

# import packages
import pandas as pd
from ngboost.ngboost import NGBoost
from ngboost.learners import default_tree_learner
from ngboost.distns import Normal
from ngboost.scores import MLE
import lightgbm as lgb
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt

Here I will use the above default learners, distributions, and scoring rule. Would be interesting to play around with those and see how the results change.

# read the dataset
df = pd.read_csv('~/train.csv')
# feature engineering
tr, te = Nanashi_solution(df)

Now predict by using NGBoost algorithm.

# NGBoost
ngb = NGBoost(Base=default_tree_learner, Dist=Normal, Score=MLE(), natural_gradient=True,verbose=False)
ngboost = ngb.fit(np.asarray(tr.drop(['SalePrice'],1)), np.asarray(tr.SalePrice))
y_pred_ngb = pd.DataFrame(ngb.predict(te.drop(['SalePrice'],1)))

Do the same with LightGBM and XGBoost.

# LightGBM
ltr = lgb.Dataset(tr.drop(['SalePrice'],1),label=tr['SalePrice'])
param = {
'bagging_freq': 5,
'bagging_fraction': 0.6,
'bagging_seed': 123,
'boost_from_average':'false',
'boost': 'gbdt',
'feature_fraction': 0.3,
'learning_rate': .01,
'max_depth': 3,
'metric':'rmse',
'min_data_in_leaf': 128,
'min_sum_hessian_in_leaf': 8,
'num_leaves': 128,
'num_threads': 8,
'tree_learner': 'serial',
'objective': 'regression',
'verbosity': -1,
'random_state':123,
'max_bin': 8,
'early_stopping_round':100
}
lgbm = lgb.train(param,ltr,num_boost_round=10000,valid_sets=[(ltr)],verbose_eval=1000)
y_pred_lgb = lgbm.predict(te.drop(['SalePrice'],1))
y_pred_lgb = np.where(y_pred>=.25,1,0)
# XGBoost
params = {'max_depth': 4, 'eta': 0.01, 'objective':'reg:squarederror', 'eval_metric':['rmse'],'booster':'gbtree', 'verbosity':0,'sample_type':'weighted','max_delta_step':4, 'subsample':.5, 'min_child_weight':100,'early_stopping_round':50}
dtr, dte = xgb.DMatrix(tr.drop(['SalePrice'],1),label=tr.SalePrice), xgb.DMatrix(te.drop(['SalePrice'],1),label=te.SalePrice)
num_round = 5000
xgbst = xgb.train(params,dtr,num_round,verbose_eval=500)
y_pred_xgb = xgbst.predict(dte)

Now we have predictions from all of the algorithms. Let’s check the accuracy. We will use the same metric as this Kaggle competition, RMSE.

# Check the results
print('RMSE: NGBoost', round(sqrt(mean_squared_error(X_val.SalePrice,y_pred_ngb)),4))
print('RMSE: LGBM', round(sqrt(mean_squared_error(X_val.SalePrice,y_pred_lgbm)),4))
print('RMSE: XGBoost', round(sqrt(mean_squared_error(X_val.SalePrice,y_pred_xgb)),4))

Here is the summary of prediction results.

Summary of the results

It seems like NGBoost outperformed other famous boosting algorithms. To be fair, I feel like if I tune the parameters of BGBoost, it will be even better.

NGBoost’s one of the biggest difference from other boosting algorithms is can return probabilistic distribution of each prediction. This can be visualised by using pred_dist function. This function enables to show the results of probabilistic predictions.

# see the probability distributions by visualising
Y_dists = ngb.pred_dist(X_val.drop(['SalePrice'],1))
y_range = np.linspace(min(X_val.SalePrice), max(X_val.SalePrice), 200)
dist_values = Y_dists.pdf(y_range).transpose()
# plot index 0 and 114
idx = 114
plt.plot(y_range,dist_values[idx])
plt.title(f"idx: {idx}")
plt.tight_layout()
plt.show()
Probability distribution examples

Above plots are the probability distributions of each prediction. X-axis shows the log value of Sale Price (target feature). We can observe that the probability distribution is wider for index 0 than index 114.

4. Conclusion and Thoughts

From the result of this experiment, we can conclude that NGBoost is as good as other famous boosting algorithms. However, computing time is quite longer than other two algorithms. This can be probably improved by using subsampling method. Also I had an impression that NGBoost package is still in progress, for example there’s no early stopping option, no option of showing the intermediate results, the flexibility of choosing the base learner (so far we can only choose between decision tree and Ridge regression), setting a random state seed, and so on. I believe these points will be implemented very soon. Or you can contribute to the project :)

Also you can find codes I used for this post on my GitHub page.

Wrap up

  • NGBoost is a new boosting algorithm that returns probability distribution.
  • Natural Gradient Boosting, a modular boosting algorithm for probabilistic prediction. This is consist of Base learner, Parametric probability distribution, and Scoring rule.
  • NGBoost predictions are quite competitive against other popular boosting algorithms.

If you found the story helpful, interesting or whatever, or also if you have any question, feedback or literally anything, feel free to leave a comment below :) I would really appreciate it. Also, you can find me on LinkedIn.

Reference:

[1] T. Duan, et al., NGBoost: Natural Gradient Boosting for Probabilistic Prediction (2019), ArXiv 1910.03225

Machine Learning
Ngboost
Boosting
Data Science
Probability
Recommended from ReadMedium