avatarAbish Pius

Summary

AutoGluon-Tabular is an open-source AutoML library that simplifies the process of building machine learning models for tabular data with Python, offering robustness, predictability, and fault tolerance.

Abstract

AutoGluon-Tabular extends the AutoGluon AutoML library's capabilities to handle tabular data through a Python SDK. It automates the entire machine learning pipeline, from data preprocessing to model training and ensembling, making it accessible for users with varying levels of expertise. The library is designed with simplicity, robustness, predictable timing, and fault tolerance in mind. It allows users to specify a time budget for training and can handle interruptions by resuming from checkpoints. AutoGluon-Tabular introduces a novel neural network architecture suitable for tabular data and automates the complex process of multi-layer stack ensembling to improve predictive accuracy.

Opinions

  • AutoGluon-Tabular is praised for its ease of use, enabling both novices and seasoned data scientists to train and deploy high-quality machine learning models efficiently.
  • The library is commended for automating time-consuming tasks such as handling missing data, feature transformations, and hyperparameter tuning, thus saving time and resources for experts.
  • The article suggests that deep learning, often considered less effective for tabular data, can achieve high predictive accuracy when using AutoGluon-Tabular's novel neural network architecture with embedding layers for categorical features.
  • The fault tolerance mechanisms, including model checkpoints and the ability to anticipate and handle training failures, are highlighted as significant advantages for real-world applications where training disruptions may occur.
  • The article implies that AutoGluon-Tabular's approach to ensembles and multi-layer stacking represents an advancement in AutoML, simplifying the process and delivering improved results.

Take Advantage of the Power of Amazon ML for FREE in Python with Autogluon!

AWS AutoGluon

AutoGluon, a versatile open-source AutoML (Automated Machine Learning) library, extends its capabilities to tabular data through a free to use Python SDK. It automates the machine learning pipeline making it accessible, robust, and time-efficient for all.

Why AutoGluon-Tabular?

Key Principles

AutoGluon-Tabular has been crafted with specific principles in mind, ensuring its usability and effectiveness:

  1. Simplicity: Users can train and deploy classification and regression models with just a few lines of code.
  2. Robustness: Raw data can be provided without the need for extensive feature engineering or data manipulation.
  3. Predictable Timing: Users can specify a time budget, allowing AutoGluon-Tabular to deliver the best model within the given constraint.
  4. Fault Tolerance: The system allows for interrupted training, enabling users to inspect and resume intermediate steps.

Time-Saving for Experts

Even seasoned data scientists stand to benefit from AutoGluon-Tabular’s automation. It streamlines time-consuming manual tasks such as handling missing data, feature transformations, data splitting, model and algorithm selection, hyperparameter tuning, and ensembling. For experts, this means saving time and resources while still obtaining high-quality models.

AutoGluon API: Three Functions to Rule Them All

AutoGluon-Tabular users interact with just three Python functions: Dataset(), fit(), and predict(). Despite the apparent simplicity, these functions encapsulate a robust machinery working behind the scenes. Let's unwrap these functions step by step.

Step 0: Setting Up

The initial step involves launching an Amazon EC2 instance, preferably with multi-core CPUs for accelerated training. AutoGluon is installed, and the task type is specified as TabularPrediction. The API remains consistent, allowing easy transitions between problem domains.

!pip install autogluon
from autogluon.tabular import TabularDataset, TabularPredictor

Step 1: Loading the Dataset

For users familiar with Pandas, the Dataset() function provides a familiar interface for loading and manipulating data. AutoGluon-Tabular manages data preprocessing automatically, eliminating the need for additional manipulation.

train_data = TabularDataset(PATH_TO_TRAIN_DATA)
test_data = TabularDataset(PATH_TO_TEST_DATA) 

Step 2: Fitting Models

The fit() function handles the heavy lifting, studying the dataset, preparing it for training, and fitting multiple models. It combines these models to produce a high-accuracy predictor.

predictor = task.fit(train_data, label=LABEL_COLUMN_NAME)

Step 3: Making Predictions

Finally, the predict() function generates predictions from new data, utilizing the models trained during the fit() phase.

prediction = predictor.predict(test_data)

The Magic of the fit() Function

Data Preprocessing

The fit() function encompasses two crucial steps: data preprocessing and model fitting.

Data Preprocessing:

  • AutoGluon-Tabular categorizes features into numeric, categorical, text, or date/time.
  • Non-categorical columns without repetition (e.g., User IDs) are discarded.
  • Text columns are transformed into numeric vectors using n-gram features.
  • Missing discrete variables are categorized as “Unknown” to handle unseen categories during predictions.

Model Fitting

Model Fitting:

  • AutoGluon-Tabular trains a series of models in a specific sequence, starting with reliable models like random forests and progressing to more computationally expensive ones like k-nearest neighbors.
  • The approach allows imposing a time limit, returning the best models within that constraint.

Supported Algorithms:

  • Random Forests
  • Extremely Randomized Trees
  • k-nearest neighbors
  • LightGBM boosted trees
  • CatBoost boosted trees
  • AutoGluon-Tabular deep neural networks

Novelty in Neural Network Architecture

Contrary to the belief that deep learning struggles with tabular data, AutoGluon-Tabular introduces a novel neural network architecture. It incorporates an embedding layer for each categorical feature, enhancing predictive accuracy, especially when combined with other model types.

Ensembles and Multi-Layer Stacking

Ensembling, a classic technique of combining multiple models for enhanced accuracy, is automated in AutoGluon-Tabular. The library introduces a multi-layer stack ensemble, simplifying the process for users.

Ensemble Layers:

  1. Base Layer: Trains multiple base models sequentially.
  2. Concat Layer: Concatenates the output of the base layer with input features.
  3. Stacker Layer: Trains multiple stacker models using the concatenated output, reusing base models.
  4. Weighting Layer: Selects stacker models into a new ensemble to maximize validation accuracy.

AutoGluon-Tabular eliminates the need for manual stacking and ensembling, automating the process and improving predictive accuracy.

Fault Tolerance with AutoGluon-Tabular

In real-world scenarios, disruptions may occur during training. AutoGluon-Tabular incorporates fault tolerance measures to mitigate potential setbacks.

  • Estimation of Training Time: The system estimates required training time, skipping layers exceeding the time budget.
  • Model Checkpoints: Models are saved immediately after training, ensuring progress is not lost.
  • Partial Training Predictions: Even in case of failure, AutoGluon-Tabular can generate predictions if at least one model on one fold is trained.
  • Algorithm-Specific Checkpoints: For algorithms supporting intermediate checkpointing, such as tree-based algorithms and neural networks, predictions can be generated using checkpoints.
  • Anticipation of Failures: AutoGluon-Tabular can anticipate potential training failures and skip to the next model.

AutoGluon-Tabular stands as a powerful tool in the realm of AutoML, particularly for tabular data. Its user-friendly API, combined with robustness, predictability, and fault tolerance, makes it an invaluable asset for both novice and experienced data scientists. The library’s innovative approaches to neural network architecture and multi-layer stacking further enhance its capabilities, automating complex processes and delivering high-quality models with ease. Whether you are new to data science or a seasoned practitioner, AutoGluon-Tabular opens the door to efficient and effective automated machine learning for tabular data.

  • Parts of this article were written using Generative AI
  • Subscribe/leave a comment if you want to stay up-to-date with the latest AI trends.
  • Earn $25 and 4.60% APY for FREE through my referral at SoFi Bank Here

Plug: Checkout all my digital products on Gumroad here. Please purchase ONLY if you have the means to do so. Use code: MEDSUB to get a 10% discount!

Autogluon
Python
Automl
Machine Learning
Beginners Guide
Recommended from ReadMedium