avatarFabio Chiusano

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

1665

Abstract

ks.</li><li><a href="https://huggingface.co/transformers/model_doc/pegasus.html">PEGASUS</a>: in pre-training, important sentences are masked from an input document and are generated together as one output sequence from the remaining sentences, similar to an extractive summary.</li></ul><p id="6001"><b>Paraphrase datasets available for finetuning</b></p><ul><li><a href="https://huggingface.co/datasets/tapaco">TAPACO</a>: a freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a crowdsourcing project mainly geared towards language learners.</li><li><a href="https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs">Quora Question Pairs</a>: a dataset containing questions marked as duplicates<b>.</b></li></ul><p id="3484"><b>Already finetuned models</b></p><ul><li><a href="https://huggingface.co/tuner007/pegasus_paraphrase">PEGASUS finetuned</a></li><li><a href="https://huggingface.co/ramsrigouthamg/t5_sentence_paraphraser">T5 finetuned</a></li><li><a href="https://huggingface.co/eugenesiow/bart-paraphrase">BART finetuned</a></li></ul><p id="f7a7"><b>Code examples</b></p><ul><li><a href="https://towardsdatascience.com/paraphrase-any-question-with-t5-text-to-text-transfer-transformer-pretrained-model-and-cbb9e35f1555">T5 finetuning</a></li></ul><p id="f070"><b>Two minutes NLP related posts</b></p><div id="5365" class="link-block"> <a href="https://readmedium.com/two-minutes-nlp-effective-intents-identification-in-short-texts-with-unsupervised-learning-61b7b670d3"> <div> <div> <h2>Two minutes NLP — Effective intents identification in

Options

short texts with unsupervised learning</h2> <div><h3>LDA, USE, Sentence-BERT, PCA, UMAP, and HDBSCAN</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*f4GiN5LC5CxAp9U1)"></div> </div> </div> </a> </div><div id="755e" class="link-block"> <a href="https://readmedium.com/two-minutes-nlp-quick-tips-to-make-your-semantic-search-projects-painless-2563cede8f23"> <div> <div> <h2>Two minutes NLP — Quick tips to make your semantic search projects painless</h2> <div><h3>Semantic search, embeddings, symmetric vs asymmetric search, and embeddings storage</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*k0iiSf5REVL1FNY0)"></div> </div> </div> </a> </div><figure id="2d66"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*wTR4oNJDgAbWsuwT.png"><figcaption>NLPlanet logo.</figcaption></figure><p id="bfa0"><i>Stay up to date with the latest stories about applied Natural Language Processing and join the NLPlanet community on <a href="https://www.linkedin.com/company/nlplanet">LinkedIn</a>, <a href="https://twitter.com/nlplanet_">Twitter</a>, <a href="https://www.facebook.com/NLPlanet-113393687828458">Facebook</a>, and <a href="https://t.me/nlplanet">Telegram</a>.</i></p></article></body>

Two minutes NLP — Building blocks to train a paraphrases generation model effortlessly

T5, BART, and PEGASUS

Photo by Sid Balachandran on Unsplash

Transfer learning and pre-trained language models in Natural Language Processing have pushed forward language understanding and generation limits. To create a model that generates paraphrases, the common procedure is to pick a pre-trained language model and finetune it on a paraphrase dataset. The more similar the pre-training procedure to paraphrase generation, the better.

Pre-trained models

  • T5: pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format.
  • BART: pre-trained corrupting text with an arbitrary noising function and learning a model to reconstruct the original text. It is particularly effective when fine-tuned for text generation but also works well for comprehension tasks.
  • PEGASUS: in pre-training, important sentences are masked from an input document and are generated together as one output sequence from the remaining sentences, similar to an extractive summary.

Paraphrase datasets available for finetuning

  • TAPACO: a freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a crowdsourcing project mainly geared towards language learners.
  • Quora Question Pairs: a dataset containing questions marked as duplicates.

Already finetuned models

Code examples

Two minutes NLP related posts

NLPlanet logo.

Stay up to date with the latest stories about applied Natural Language Processing and join the NLPlanet community on LinkedIn, Twitter, Facebook, and Telegram.

NLP
Artificial Intelligence
Transformers
Hugging Face
Language Model
Recommended from ReadMedium