avatarBenjamin Marie

Summary

SPIN is a method for self-training LLMs using synthetic data generated by the LLM itself, without additional data or external LLMs, and it surpasses the performance achieved after DPO with additional preference data.

Abstract

The article discusses SPIN, a method for self-training LLMs using synthetic data generated by the LLM itself, without additional data or external LLMs. SPIN uses two "player" models, the main player and the opponent player, which are iteratively updated given their respective feedback. The main player is trained to distinguish LLM responses from human responses, while the opponent player seeks to improve the LLM, making its responses indistinguishable from human data for the main player. SPIN surpasses the performance achieved after DPO with additional preference data and is more effective than training for more epochs. The article also mentions that SPIN and its players have a lot of similarities with GANs, which are well-known to be difficult to train.

Opinions

  • SPIN is an effective method for self-training LLMs using synthetic data generated by the LLM itself.
  • SPIN surpasses the performance achieved after DPO with additional preference data.
  • SPIN is more effective than training for more epochs.
  • SPIN and its players have a lot of similarities with GANs, which are well-known to be difficult to train.
  • The article does not provide clear information on how difficult it is to find the right hyperparameters for SPIN.
  • The article mentions that SPIN requires only the initial model and the existing supervised fine-tuning dataset for fine-tuning.
  • The article mentions that the initial model used by the authors is Zephyr trained without DPO.

SPIN: Self-play Fine-tuning to Improve LLMs without Additional Data

LLM self-training

Synthetic data generated by LLMs are successfully used to train smaller LLMs. Phi-2 and Zephyr are two very good examples of popular LLMs trained on synthetic data. But these data are additional data, i.e., we need another, better, LLM to generate them.

Can the LLM improve its fine-tuning using synthetic data that it has generated by itself, i.e., without using any additional data or external LLMs?

To answer this question, Chen et al. propose SPIN, a method for self-training LLM:

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

SPIN uses two “player” models. The main player is trained to distinguish LLM responses from human responses by minimizing a specific value. This value reflects the degree of belief that a given response, given a prompt, originates from a human rather than the LLM. The opponent player model seeks to improve the LLM, making its responses indistinguishable from human data for the main player.

Both the main and opponent players are iteratively updated given their respective feedback.

For fine-tuning, SPIN requires only the initial model and the existing supervised fine-tuning dataset, enabling LLM self-improvement. The initial model used by the authors is Zephyr trained without DPO:

SPIN surpasses the performance achieved after DPO with additional preference data.

source

Iterative training is more effective than training for more epochs, with SPIN maintaining performance even with extended training durations.

source

The fact that SPIN outperforms DPO without additional data is quite impressive. The paper is, however, not very clear on how difficult it is to find the right hyperparameters. SPIN and its players have a lot of similarities with GANs (Generative Adversarial Networks) which are well-known to be difficult to train.

This article was originally published in the The Weekly Kaitchup:

Machine Learning
Artificial Intelligence
Data Science
Technology
Programming
Recommended from ReadMedium