Free AI web copilot to create summaries, insights and extended knowledge, download it at here

9123

Abstract

iv> </figure></iframe></div></div></figure><h2 id="2e1f">7. Converting Train and Validation Sets to TF Datasets</h2>Next, I converted the datasets to <code>tf.data.Dataset</code>, that Keras can understand natively; for this purpose I used <code>Model.prepare_tf_dataset()</code>.With respect to the <code>Dataset.to_tf_dataset()</code> method, <code>Model.prepare_tf_dataset()</code> can automatically determine which column names to use as input and provides a default data collator.Note that I only shuffled the train data. After some experiments, I found that the optimal batch size = 16. <figure id="cb48"> <div> <div>

            <iframe class="gist-iframe" src="/gist/EdoWhite/a6290cfa0bf8fa075d6c7706b7e7fc07.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><h2 id="82cd">8. Compiling, Fitting, and Evaluating the Model</h2><p id="536e">Before fitting the model, I set up a <b>learning rate scheduler</b> and an <b>optimizer</b>. I used the <code>ExponentialDecay</code> scheduler from <b>Keras</b> and the <code>AdamWeightDecay</code> optimizer from <b>Huggingface</b>.</p><p id="f4fd"><b>Learning rate decay</b> is a technique to <b>reduce the learning rate over time</b>. With <b>exponential decay</b>, the learning rate is reduced <b>exponentially</b>.</p>
    <figure id="4c78">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/EdoWhite/5f5f13f65a2e561d10f8220652a6d133.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="627b">Next, I <b>compiled the model</b>. Transformers models generally <b>compute loss internally</b> and there is no need to specify a loss parameter. For <b>language modeling</b> the selected loss is <b>cross-entropy</b>.</p>
    <figure id="252a">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/EdoWhite/36094469200de6fccb9751df1ce1a205.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure>
    <figure id="aa0b">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/EdoWhite/8d245e2d37cbbb7c08a7990d9e2b29d6.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="6da1">At this point, I set up a <b>callback to the Huggingface Hub</b> to <b>save the fine-tuned model</b>.</p>
    <figure id="089d">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/EdoWhite/d764f70617be0a10fe818ff7f6bb3e80.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="ae79">I also set up a <b>callback to Tensorboard</b>.</p>
    <figure id="8ff6">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/EdoWhite/c88e69a95b9e51f3d880048c1e4d0110.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure>
    <figure id="03a4">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/EdoWhite/6b3495c5dcd9988d0ffcdff6e38f308a.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="914b">Finally, I <b>fitted the model</b> by calling the <code>fit()</code> method. I specified the <b>train</b> and <b>validation</b> sets and the <b>number of epochs</b>.</p>
    <figure id="af6a">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/EdoWhite/b98130b229a0612c60ee8b7fab84f7bf.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="3785">After the training step,<b> </b>I <b>evaluated the model</b> and got its <b>cross-entropy loss</b> on the validation set.</p>
    <figure id="a0df">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/EdoWhite/816ffa4ea90bdca3a21e962a2b1c0869.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="c90a"><b>Loss=2.2371</b>. Generally, the <b>quality of a language model</b> is measured in ‘<b>perplexity</b>’. To convert cross-entropy to perplexity, I simply <b>raised <i>e</i> to the power of the cross-entropy</b> loss.</p>
    <figure id="d594">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/EdoWhite/09be3e8f2af7762703b8cc19b422233c.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="b736">In this case <b>perplexity=9.37</b>.</p><h2 id="4d0d">9. Generating Text Using a Pipeline</h2><p id="1572">At this point, I leveraged the <code>pipeline</code> functionality provided by Huggingface to <b>see the model in action</b>.</p><p id="7c8a">I set up a <b>text-generation pipeline and </b>specified the <b>fine-tuned</b> <b>model, </b>the<b> tokenizer, </b>and the<b> framework </b>to use<b>. </b><code>max_new_tokens</code> allows specifying the <b>maximum number of tokens</b> (words) to generate in addition to the initial prompt provided.</p>
    <figure id="e893">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/EdoWhite/36e77936e3d7339b052f2a3066b9c261.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="4ccd">Two lines of code are enough to <b>generate text with a pipeline</b>:</p>
    <figure id="325f">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/EdoWhite/c9ecddd57a16590e3960a07c2f171612.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="6ad7">The <code>pipeline</code> is <b>not the only way</b> to use a model: it is possible to <b>manually tokenize the prompt</b>, <b>generate new tokens</b>, and <b>decode the tokens</b> to natural language. Here’s an example:</p>
    <figure id="b5f6">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/EdoWhite/8ff12d9f2355665d4422e3eced59e6d1.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><h2 id="a5f4">12. Results Analysis</h2><p id="7506">After fine-tuning the model, I wanted to understand <b>what the model has learned</b> and how the <b>generated text is influenced</b> by the fact that <b>paper abstracts were used for training</b>.</p><p id="796e">First, I generated a sample text by using <i>“the role of recommender systems”</i> as a prompt. This is the <b>output generated by the model:</b></p><div id="0e21"><pre><span class="hljs-comment">'the role of recommender systems in the real-world is still largely to be demonstrated by the lack of data and the need for data. Hence, for many recommendation systems such as Amazon or Spotify, it is necessary to provide a user knowledge of the content that has been clicked during the recommendation and provide a user knowledge of the user preferences. The previous works attempt to exploit data related to items they have clicked during an appropriate time frame. But little attention has been paid to the problem of item classification where a suitable time-frame is available for user prediction. In this paper, we propose a multi-task learning approach to address the problem of item classification. For each task, we apply the contextual cues introduced by the user, and then learn to predict the user's purchased items' interests.   Since the contexts of user preferences, we consider the feature that the user's preference (the time-frame) is present at the time of recommendation. In particular, we propose an alternative method for attribute-aware learning that utilizes the contextual cues in the sequence a

Options

nd the user's preferences to learn a classifier that classifies the user according to the contextual cues. This is done by maximizing the mutual information between the user's rating and the content-aware prediction task. The experimental results show that our model achieves better accuracy than the existing state-of-the-art methods, achieving up to 33.6% more accuracy on real-world recommendation tasks compared to the state-of-the-art methods. Our source code is available at http://github.com/J-medylerFashion/jmedian.github.'</pre></div>This result sounded somehow copied & pasted from one of the existing abstracts, but after a check with some anti-plagiarism solutions, I realized that it is 100% unique.During learning, the model captured common features of the abstracts and learned how to replicate them while still generating fresh text. Interestingly, the model used scientific language and common expressions: The previous works…, In this paper…, We propose…, The experimental result….The model also learned that sometimes a repository is added to the abstract: in this example, the text generated contains an URL to a GitHub repository. The URL and the repository don’t exist, thus have been generated by the model (and not copied).As a second experiment, I generated a sample text by using “clustering” as a prompt. In this case, the prompt consisted of only one word, so the text generation is not driven by additional context. This is the output:<div id="8b04"><pre>'clustering can be used to extract clusters from data points. However, in many real-world scenarios, data points often appear in non-Euclidean relaxations, which allows different clusters to be discovered simultaneously without the need for specific optimization. In this paper, we consider a class of applications where clustering methods are applied. A common example is that in image selection problems, we show to the optimizer that the expected improvement will be obtained by minimizing the average performance of the clustering method. Our main contribution is a characterization of optimization problems with respect to clustering methods. Specifically, we present a generalization of the optimal transport method for which our special bounds in terms of the maximum number of clusters are derived. We apply this algorithm to classification of human brain tumors, showing that it is computationally efficient. We show that based on our analysis of a particular type of tumor, our analysis leads to faster convergence to the optimal solution.'</pre></div>Also in this case the text is reported to be unique and there are common expressions: In this paper…, Our main contribution…, We present….For the third experiment, I used “transformers for text generation”. Here’s the result:<div id="9357"><pre>'transformers for text generation are becoming increasingly popular in computer vision, and so are the prevalent datasets that contain high-dimensional representations without manual feature engineering. We propose two algorithms for image generation from convolutional networks. We give the first procedure to remove important parts of this architecture and propose a novel architecture dubbed Multi-scale Text Generation Network (MTVGNet). Our MTVGNet-like architecture produces a compact set of representations without changing the model architecture or the pre-trained convolutional network. The resulting representation is compact and can be used as both training examples in training and inference in inference. Extensive numerical experiments on image synthesis and computer vision demonstrate that MTVGNet-like improves generative model performance by up to 32% over state-of-the-art methods for unconditional image generation, while incurring only 20% higher frame quality.'</pre></div>Even in the third example, there are multiple elements common to scientific abstracts, and the overall quality is slightly better than the previous one.The cited Multi-scale Text Generation Network (MTVGNet) seems to be an “invention” of the model since I cannot find references in the literature.I’d like to conclude this section with some word clouds. The first one represents the most frequent words in all the abstracts in the dataset. The others depict the most common words in 10 text samples generated from different prompts.It is possible to immediately notice the similarity of words between the dataset abstracts and the generated samples.<figure id="fa54"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*LFthhAY0muMc5MjOS0-__w.png"><figcaption>Most frequent words in all the abstracts. Image by the author.</figcaption></figure><figure id="757b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*TjpZJdCaDBBrWXatnAla3w.png"><figcaption>Most frequent words in the 10 AI-generated text samples. Prompt: the role of recommender systems. Image by the author.</figcaption></figure><figure id="ccc2"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*bnmMgSZdFdhDaEazOSs2QQ.png"><figcaption>Most frequent words in the 10 AI-generated text samples. Prompt: clustering. Image by the author.</figcaption></figure><figure id="36ff"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*TjVU0-tR6EnX0_wvwJBhDA.png"><figcaption>Most frequent words in the 10 AI-generated text samples. Prompt: transformers for text generation. Image by the author.</figcaption></figure><h2 id="1f10">Conclusions</h2>In this article, I fine-tuned a transformer on scientific paper abstracts. What is the quality of the result? What are the limitations of this approach? Is it possible to get GPT-2 to write a full paper?The model has learned how abstracts are generally written and tries to replicate the same style. The results are not bad considering the available data and only one training epoch.The model seems to be able to generate technical text about different machine learning topics, but the result does not always make complete sense and sometimes there are mistakes.Certainly, this approach has some limitations, one of which is the length of generated text. Although it is possible to overcome the problem by generating multiple blocks of text, at some point it would be hard to logically connect the different sections generated.In conclusion, even if the model cannot write an entire technical article, I am still surprised and convinced that some of the achievable results can still be inspiring or cueing.Thanks for reading!<h2 id="3609">Additional Resources</h2><ul><li>E. Bianchi, <a href="http://colab.research.google.com/drive/1APs0b3PaLYj77IVRY3qX_5VwOzQDBg_r?usp=sharing"></a><a href="http://colab.research.google.com/drive/1APs0b3PaLYj77IVRY3qX_5VwOzQDBg_r?usp=sharing">Fine-Tuning GPT-2 for Text Generation with Tensorflow</a> (2022), Google Colaboratory</li><li>Hugging Face, <a href="http://huggingface.co/docs"></a><a href="http://huggingface.co/docs">Documentation</a> (2022)</li><li>Hugging Face, <a href="http://huggingface.co/distilgpt2">Distilgpt2</a> (2022)</li><li>Cornell University, <a href="https://www.kaggle.com/datasets/Cornell-University/arxiv">arXiv Dataset on Kaggle</a> (2022)</li><li>CShorten, <a href="http://huggingface.co/datasets/CShorten/ML-ArXiv-Papers">ML-ArXiv-Papers dataset</a> (2022)</li><li>Wikipedia contributors, <a href="http://en.wikipedia.org/wiki/Perplexity">Perplexity</a> (2022)</li><li>Wikipedia contributors, <a href="http://en.wikipedia.org/wiki/Cross_entropy">Cross Entropy</a> (2022)</li><li>Wikipedia contributors, <a href="http://en.wikipedia.org/wiki/GPT-2">GPT-2</a> (2022)</li><li>Keras team, <a href="http://keras.io/api/optimizers/learning_rate_schedules/exponential_decay">Keras Documentation: ExponentialDecay</a> (2022)</li></ul>More content at <a href="https://plainenglish.io/">PlainEnglish.io</a>. Sign up for our <a href="http://newsletter.plainenglish.io/">free weekly newsletter</a>. Follow us on <a href="https://twitter.com/inPlainEngHQ">Twitter</a>, <a href="https://www.linkedin.com/company/inplainenglish/">LinkedIn</a>, <a href="https://www.youtube.com/channel/UCtipWUghju290NWcn8jhyAw">YouTube</a>, and <a href="https://discord.gg/GtDtUAvyhW">Discord</a>. Interested in Growth Hacking? Check out <a href="https://circuit.ooo/">Circuit</a>.</article></body>

I Fine-Tuned GPT-2 on 100K Scientific Papers. Here’s The Result

Content writing by AI is common, but is it possible for an AI to write technical essays?

Scientific text generation with AI. Image by the author.

Artificial agents are widely used nowadays and are able to achieve superhuman performance in multiple tasks. Text generation is one of the emerging applications of AI and is used in several scenarios. Freeform text generation, Q&A, and abstractive summarization are only some of them.

To investigate whether an AI could write technical essays I trained a casual language model on about 100K machine learning papers.

What is the quality of the result? What are the limitations of the proposed approach? Is it possible to get GPT-2 to write a full paper? These are the question that I will try to answer.

Introduction

The Generative Pre-Trained Transformer (GPT) 2 is an artificial intelligence developed by OpenAI in 2019 and allows for several purposes: text summarization, translation, question-answering, and text generation. GPT-2 is pre-trained on a large English data corpus, furthermore can be fine-tuned for a specific task.

In this article, I will use the Huggingface Distilled-GPT2 (DistilGPT2) model. DistilGPT2 has 82 million parameters and was developed by knowledge distillation, moreover is lighter and faster than GPT-2.

1. Importing Tools

I started by importing all the required tools and libraries.

2. Importing the baseline model and tokenizer

Then, I used TFAutoModelForCasualLM and AutoTokenizer to automatically load the correct model based on a specific checkpoint. A checkpoint contains the weights of a pre-trained model.

In this case, I imported the DistilGPT-2 checkpoint. I also set the end-of-sequence token as a padding token.

3. Importing Data

The dataset for the fine-tuning operation is available on the Huggingface Hub, and it’s a subset of a bigger dataset hosted on Kaggle.

The original dataset, published by Cornell University, contains titles and abstracts of 1.7M+ scientific papers belonging to the STEM category. The subset hosted on the Huggingface Hub contains information on around 100K papers pertaining to the machine learning category.

I decided to fine-tune DistilGPT-2 on abstracts only. I started by loading the dataset from the Huggingface Hub.

The dataset consists of 117592 rows and has 4 columns (two of them are useless).

After this step, I decided to visualize the length distribution of the abstracts (in terms of words) with a histogram.

Abstracts length distribution. Image by the author.

Most of the abstracts are between about 100 and 250 words in length, and only a few are over 300 words. In particular: mode=150, mean=167, and median=164.

In addition to giving information about the dataset, the histogram allowed me to determine the maximum length of the inputs to be fed to the model.

I decided to set the maximum input length to 300 tokens: abstracts longer than this will be truncated. This is because all inputs must be padded to the same length, and long sequences of text greatly increase the training time.

4. Split into Train and Validation Set

Next, I split the dataset into train and validation sets with train_test_split() . It is also possible to specify the partition sizes with the test_size parameter.

train_test_split() returns a dictionary of Datasets, formerly a DatasetDict. While it is possible to work with a DatasetDict, I prefer to use two separate Datasets: train and val.

5. Tokenize Data with HF Tokenizer

To tokenize the data I defined a generic tokenization function, and then I applied this function to all the samples by using map(). Inside the tokenization function, I used the tokenizer imported in the beginning.

The tokenizer has some important parameters to set:

column to tokenize. In this case “abstract”.
padding. In this case = “max_lenght” to pad a sequence to the maximum length specified by the max_length parameter.
truncation. If true, truncates sequences longer than the maximum length, specified by the max_length parameter.
max_length. Specifies the maximum length of a sequence.

Please note that by default the map() method sends batches of 1000 samples.

6. Adding Labels to Train and Validation Sets

In Casual Language Modeling, the labels are the input tokens (input_ids) right-shifted. This operation is automatically done by the Huggingface transformer, thus I created a labels column in the datasets with a copy of the tokens (input_ids).

After this operation, the train and validation sets had three columns: input_ids and attention_mask from the tokenization process, and labels from the create_labels() process.

7. Converting Train and Validation Sets to TF Datasets

Next, I converted the datasets to tf.data.Dataset, that Keras can understand natively; for this purpose I used Model.prepare_tf_dataset().

With respect to the Dataset.to_tf_dataset() method, Model.prepare_tf_dataset() can automatically determine which column names to use as input and provides a default data collator.

Note that I only shuffled the train data. After some experiments, I found that the optimal batch size = 16.

8. Compiling, Fitting, and Evaluating the Model

Before fitting the model, I set up a learning rate scheduler and an optimizer. I used the ExponentialDecay scheduler from Keras and the AdamWeightDecay optimizer from Huggingface.

Learning rate decay is a technique to reduce the learning rate over time. With exponential decay, the learning rate is reduced exponentially.

Next, I compiled the model. Transformers models generally compute loss internally and there is no need to specify a loss parameter. For language modeling the selected loss is cross-entropy.

At this point, I set up a callback to the Huggingface Hub to save the fine-tuned model.

I also set up a callback to Tensorboard.

Finally, I fitted the model by calling the fit() method. I specified the train and validation sets and the number of epochs.

After the training step, I evaluated the model and got its cross-entropy loss on the validation set.

Loss=2.2371. Generally, the quality of a language model is measured in ‘perplexity’. To convert cross-entropy to perplexity, I simply raised e to the power of the cross-entropy loss.

In this case perplexity=9.37.

9. Generating Text Using a Pipeline

At this point, I leveraged the pipeline functionality provided by Huggingface to see the model in action.

I set up a text-generation pipeline and specified the fine-tuned model, the tokenizer, and the framework to use. max_new_tokens allows specifying the maximum number of tokens (words) to generate in addition to the initial prompt provided.

Two lines of code are enough to generate text with a pipeline:

The pipeline is not the only way to use a model: it is possible to manually tokenize the prompt, generate new tokens, and decode the tokens to natural language. Here’s an example:

12. Results Analysis

After fine-tuning the model, I wanted to understand what the model has learned and how the generated text is influenced by the fact that paper abstracts were used for training.

First, I generated a sample text by using “the role of recommender systems” as a prompt. This is the output generated by the model:

'the role of recommender systems in the real-world is still largely to be demonstrated by the lack of data and the need for data. Hence, for many recommendation systems such as Amazon or Spotify, it is necessary to provide a user knowledge of the content that has been clicked during the recommendation and provide a user knowledge of the user preferences. The previous works attempt to exploit data related to items they have clicked during an appropriate time frame. But little attention has been paid to the problem of item classification where a suitable time-frame is available for user prediction. In this paper, we propose a multi-task learning approach to address the problem of item classification. For each task, we apply the contextual cues introduced by the user, and then learn to predict the user's purchased items' interests.   Since the contexts of user preferences, we consider the feature that the user's preference (the time-frame) is present at the time of recommendation. In particular, we propose an alternative method for attribute-aware learning that utilizes the contextual cues in the sequence and the user's preferences to learn a classifier that classifies the user according to the contextual cues. This is done by maximizing the mutual information between the user's rating and the content-aware prediction task. The experimental results show that our model achieves better accuracy than the existing state-of-the-art methods, achieving up to 33.6% more accuracy on real-world recommendation tasks compared to the state-of-the-art methods. Our source code is available at http://github.com/J-medylerFashion/jmedian.github.'

This result sounded somehow copied & pasted from one of the existing abstracts, but after a check with some anti-plagiarism solutions, I realized that it is 100% unique.

During learning, the model captured common features of the abstracts and learned how to replicate them while still generating fresh text. Interestingly, the model used scientific language and common expressions: The previous works…, In this paper…, We propose…, The experimental result….

The model also learned that sometimes a repository is added to the abstract: in this example, the text generated contains an URL to a GitHub repository. The URL and the repository don’t exist, thus have been generated by the model (and not copied).

As a second experiment, I generated a sample text by using “clustering” as a prompt. In this case, the prompt consisted of only one word, so the text generation is not driven by additional context. This is the output:

'clustering can be used to extract clusters from data points. However, in many real-world scenarios, data points often appear in non-Euclidean relaxations, which allows different clusters to be discovered simultaneously without the need for specific optimization. In this paper, we consider a class of applications where clustering methods are applied. A common example is that in image selection problems, we show to the optimizer that the expected improvement will be obtained by minimizing the average performance of the clustering method. Our main contribution is a characterization of optimization problems with respect to clustering methods. Specifically, we present a generalization of the optimal transport method for which our special bounds in terms of the maximum number of clusters are derived. We apply this algorithm to classification of human brain tumors, showing that it is computationally efficient. We show that based on our analysis of a particular type of tumor, our analysis leads to faster convergence to the optimal solution.'

Also in this case the text is reported to be unique and there are common expressions: In this paper…, Our main contribution…, We present….

For the third experiment, I used “transformers for text generation”. Here’s the result:

'transformers for text generation are becoming increasingly popular in computer vision, and so are the prevalent datasets that contain high-dimensional representations without manual feature engineering. We propose two algorithms for image generation from convolutional networks. We give the first procedure to remove important parts of this architecture and propose a novel architecture dubbed Multi-scale Text Generation Network (MTVGNet). Our MTVGNet-like architecture produces a compact set of representations without changing the model architecture or the pre-trained convolutional network. The resulting representation is compact and can be used as both training examples in training and inference in inference. Extensive numerical experiments on image synthesis and computer vision demonstrate that MTVGNet-like improves generative model performance by up to 32% over state-of-the-art methods for unconditional image generation, while incurring only 20% higher frame quality.'

Even in the third example, there are multiple elements common to scientific abstracts, and the overall quality is slightly better than the previous one.

The cited Multi-scale Text Generation Network (MTVGNet) seems to be an “invention” of the model since I cannot find references in the literature.

I’d like to conclude this section with some word clouds. The first one represents the most frequent words in all the abstracts in the dataset. The others depict the most common words in 10 text samples generated from different prompts.

It is possible to immediately notice the similarity of words between the dataset abstracts and the generated samples.

Most frequent words in all the abstracts. Image by the author.

Most frequent words in the 10 AI-generated text samples. Prompt: the role of recommender systems. Image by the author.

Most frequent words in the 10 AI-generated text samples. Prompt: clustering. Image by the author.

Most frequent words in the 10 AI-generated text samples. Prompt: transformers for text generation. Image by the author.

Conclusions

In this article, I fine-tuned a transformer on scientific paper abstracts. What is the quality of the result? What are the limitations of this approach? Is it possible to get GPT-2 to write a full paper?

The model has learned how abstracts are generally written and tries to replicate the same style. The results are not bad considering the available data and only one training epoch.

The model seems to be able to generate technical text about different machine learning topics, but the result does not always make complete sense and sometimes there are mistakes.

Certainly, this approach has some limitations, one of which is the length of generated text. Although it is possible to overcome the problem by generating multiple blocks of text, at some point it would be hard to logically connect the different sections generated.

In conclusion, even if the model cannot write an entire technical article, I am still surprised and convinced that some of the achievable results can still be inspiring or cueing.

Thanks for reading!

Additional Resources

E. Bianchi, Fine-Tuning GPT-2 for Text Generation with Tensorflow (2022), Google Colaboratory
Hugging Face, Documentation (2022)
Hugging Face, Distilgpt2 (2022)
Cornell University, arXiv Dataset on Kaggle (2022)
CShorten, ML-ArXiv-Papers dataset (2022)
Wikipedia contributors, Perplexity (2022)
Wikipedia contributors, Cross Entropy (2022)
Wikipedia contributors, GPT-2 (2022)
Keras team, Keras Documentation: ExponentialDecay (2022)

More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord. Interested in Growth Hacking? Check out Circuit.