TPS-Mar21, Leaderboard %14, XGB, CatBoost, LGBM + Optuna 🚀

Part 2, Xgboost, CatBoost, and Lightgbm with Optuna…

Design vector created by freepik — www.freepik.com

Modeling is one of the most important parts of predictions. You should find the best machine learning model for better results. In part 1, we worked on EDA and feature engineering. You can see this article here.

In this part of the article, we compared three ml models that are Xgboost, Catboost, and LGBM. The competition metric is Area Under the Receiver Operating Characteristic Curve (ROC AUC).

You can see the dataset here and you can see full python code at the end of the article.

Introduction

Firstly we imported the libraries and then calculated the baseline scores of these machine learning models.

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

import optuna
from optuna.visualization import plot_optimization_history
from optuna.visualization import plot_param_importances

# XGBClassifier
xgbc_model = XGBClassifier(min_child_weight=0.1, reg_lambda=100, booster='gbtree', objective='binary:logitraw', random_state=42)
xgbc_score = cross_val_score(xgbc_model, train_X, train_y, scoring='roc_auc', cv=5)
print('xgbc_score: ', xgbc_score.mean())

# LGBMClassifier
ligthgbmc_model = LGBMClassifier(boosting_type='gbdt', objective='binary', random_state=42)
ligthgbmc_score = cross_val_score(ligthgbmc_model, train_X, train_y, scoring='roc_auc', cv=5)
print('ligthgbmc_score: ', ligthgbmc_score.mean())

# CatBoostClassifier
cbc_model = CatBoostClassifier(loss_function='Logloss', random_state=42, verbose=False)
cbc_score = cross_val_score(cbc_model, train_X, train_y, scoring='roc_auc', cv=5)
print('cbc_score: ', cbc_score.mean())

####################################################################

Outputs:
xgbc_score:  0.8898202612356174
ligthgbmc_score:  0.8879385374274603
cbc_score:  0.8909648517647316

Xgboost + Optuna

According to baseline scores, the best model is catboost but it can be changed after hyperparameter tuning. You can see XGB usage with Optuna below.

def objective(trial, data=X, target=y):
    X_train, X_val, y_train, y_val = train_test_split(data, target, test_size=0.2, random_state=42)

    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 32),
        'learning_rate': trial.suggest_categorical('learning_rate', [0.005, 0.02, 0.05, 0.08, 0.1]),
        'n_estimators': trial.suggest_int('n_estimators', 2000, 8000),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 300),
        'gamma': trial.suggest_float('gamma', 0.0001, 1.0, log = True),
        'alpha': trial.suggest_float('alpha', 0.0001, 10.0, log = True),
        'lambda': trial.suggest_float('lambda', 0.0001, 10.0, log = True),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.1, 0.8),
        'subsample': trial.suggest_float('subsample', 0.1, 0.8),
        'tree_method': 'gpu_hist',
        'booster': 'gbtree',
        'random_state': 42,
        'use_label_encoder': False,
        'eval_metric': 'auc'

    }
    
    model = XGBClassifier(**params)  
    model.fit(X_train, y_train, eval_set = [(X_val,y_val)], early_stopping_rounds = 333, verbose = False)
    y_pred = model.predict_proba(X_val)[:,1]
    roc_auc = roc_auc_score(y_val, y_pred)

    return roc_auc

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
print('Best value: ', study.best_value)

####################################################################

Outputs:
Best value: 0.8951492161710065

CatBoost + Optuna

def objective(trial, data=X, target=y):
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 64),
        'learning_rate': trial.suggest_categorical('learning_rate', [0.005, 0.02, 0.05, 0.08, 0.1]),
        'n_estimators': trial.suggest_int('n_estimators', 2000, 8000),
        'max_bin': trial.suggest_int('max_bin', 200, 400),
        'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 1, 300),
        'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 0.0001, 1.0, log = True),
        'subsample': trial.suggest_float('subsample', 0.1, 0.8),
        'random_seed': 42,
        'task_type': 'GPU',
        'loss_function': 'Logloss',
        'eval_metric': 'AUC',
        'bootstrap_type': 'Poisson'
    }
    
    model = CatBoostClassifier(**params)  
    model.fit(X_train, y_train, eval_set = [(X_val,y_val)], early_stopping_rounds = 222, verbose = False)
    y_pred = model.predict_proba(X_val)[:,1]
    roc_auc = roc_auc_score(y_val, y_pred)

    return roc_auc

study = optuna.create_study(direction = 'maximize')
study.optimize(objective, n_trials = 50)
print('Best value:', study.best_value)

####################################################################

Outputs:
Best value: 0.8925910141177894

LGBM + Optuna

After hyperparameter optimization, we can see that LGBM is the best model now.

def objective(trial,data=X,target=y):   
    train_x, test_x, train_y, test_y = train_test_split(data, target, test_size=0.15,random_state=42)
    params = {
        'reg_alpha': trial.suggest_float('reg_alpha', 0.001, 10.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.001, 10.0),
        'num_leaves': trial.suggest_int('num_leaves', 11, 333),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
        'max_depth': trial.suggest_int('max_depth', 5, 64),
        'learning_rate': trial.suggest_categorical('learning_rate', [0.01, 0.02, 0.05, 0.005, 0.1]),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.1, 0.5),
        'n_estimators': trial.suggest_int('n_estimators', 2000, 8000),
        'cat_smooth' : trial.suggest_int('cat_smooth', 10, 100),
        'cat_l2': trial.suggest_int('cat_l2', 1, 20),
        'min_data_per_group': trial.suggest_int('min_data_per_group', 50, 200),
        'cat_feature' : [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 
                         32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 
                         53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67],
        'n_jobs' : -1, 
        'random_state': 42,
        'boosting_type': 'gbdt',
        'metric': 'AUC',
        'device': 'gpu'
    }

    model = LGBMClassifier(**params)  
    model.fit(train_x,train_y,eval_set=[(test_x,test_y)],eval_metric='auc', early_stopping_rounds=300, verbose=False)
    preds = model.predict_proba(test_x)[:,1]
    auc = roc_auc_score(test_y, preds)
    
    return auc

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

####################################################################

Outputs:
Best value:  0.8966645758299353

Visualizations

Optimization History

# Historic
plot_optimization_history(study)

Hyperparameter Importances

# Importance
optuna.visualization.plot_param_importances(study)

Conclusion

This is part 2 of the TPS-Mar21 competition that I am in LB %14. In this article, we compared famous machine learning boosting models for better prediction. Due to the results, Lightgbm is the best model for this problem.

According to the problem, the best boosting model can change. Also, sometimes speed can be more important than success. You can find more detailed information in this article about when to choose which boosting model.

You can see full python code and all plots from here 👉 Kaggle Notebook.

👋 Thanks for reading. If you enjoy my work, don’t forget to like it 👏, follow me on Medium and LinkedIn. It will motivate me in offering more content to the Medium community! 😊

References:

[1]: https://www.kaggle.com/hasanbasriakcay/xgb-catboost-lgbm-optuna-lb-14/notebook [2]: https://www.kaggle.com/c/tabular-playground-series-mar-2021/data [3]: https://optuna.readthedocs.io/en/stable/reference/study.html [4]: https://xgboost.readthedocs.io/en/stable/ [5]: https://catboost.ai/en/docs/ [6]: https://lightgbm.readthedocs.io/en/latest/ [7]: https://neptune.ai/blog/when-to-choose-catboost-over-xgboost-or-lightgbm

Hasan Basri Akçay - Data Engineer - İnelso Energy Systems | LinkedIn

View Hasan Basri Akçay's profile on LinkedIn, the world's largest professional community. Hasan Basri has 6 jobs listed…

www.linkedin.com

More…

What Are Baseline Models and Benchmarking For Machine Learning, Why We Need Them?

Random, Machine Learning, Automated ML Baseline Models and Benchmarking For ML…

pub.towardsai.net

The One Image Every Data Scientist Should Be Aware Of

Technical debt in machine learning is an important topic for every data scientist.

medium.com

5 Important Python Libraries and Methods For Data Scientists!

Most of the python libraries are already written for data science but newbies working in data science and machine…

medium.com

Welcome, 2022🎉. What Has Changed in Data Science in 2021?

Best Data Science Tools, Methods, and Techniques such as Cloud Computing Product, Automated ML Tools, Courses, IDEs…

medium.com

What Are The Differences Between Data Scientists That Earn 500💲 And 225.000💲 Yearly?

This article is about important talents, tools, features of the country, and features of the company for high income in…

medium.com