Two minutes NLP — Building blocks to train a paraphrases generation model effortlessly
T5, BART, and PEGASUS
Transfer learning and pre-trained language models in Natural Language Processing have pushed forward language understanding and generation limits. To create a model that generates paraphrases, the common procedure is to pick a pre-trained language model and finetune it on a paraphrase dataset. The more similar the pre-training procedure to paraphrase generation, the better.
Pre-trained models
- T5: pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format.
- BART: pre-trained corrupting text with an arbitrary noising function and learning a model to reconstruct the original text. It is particularly effective when fine-tuned for text generation but also works well for comprehension tasks.
- PEGASUS: in pre-training, important sentences are masked from an input document and are generated together as one output sequence from the remaining sentences, similar to an extractive summary.
Paraphrase datasets available for finetuning
- TAPACO: a freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a crowdsourcing project mainly geared towards language learners.
- Quora Question Pairs: a dataset containing questions marked as duplicates.
Already finetuned models
Code examples
Two minutes NLP related posts

Stay up to date with the latest stories about applied Natural Language Processing and join the NLPlanet community on LinkedIn, Twitter, Facebook, and Telegram.
