Hands-on Tutorials

Transformers, can you rate the complexity of reading passages?

Fine-tuning RoBERTa with PyTorch to predict reading ease of text excerpts

Transformers, what are they actually? They are not the devices used in power transmission of electric energy, nor the fictional living autonomous robots Optimus Prime or Bumblebee who can transform into other objects like vehicles. Transformers in our context here is referring to BERT, ALBERT, RoBERTa, and the like, where they are used in the world of data science to solve all kinds of Natural Language Processing tasks, such as machine translation, text summarization, speech recognition, sentiment analysis and many more. They are state-of-the-art language models for Natural Language Processing, and they have gained tremendous popularity over the past few years.

This post will demonstrate the fine-tuning of Transformer model, specifically RoBERTa, on our dataset of interest. Fine-tuning is done for a downstream task to predict the reading ease of excerpts from literature for grades 3–12 classroom use.

This piece of work is motivated by an initiative from CommonLit, a nonprofit education technology organization. It sponsored a competition hosted on Kaggle (you can read more about it here), aiming to use machine learning to seek improvement over existing readability rating methods. This will aid literacy curriculum developers as well as educators in choosing appropriate reading passages for students. Presenting engaging passages with the right level of complexity and reading challenge will greatly benefit students in developing essential reading skills.

1. About the Dataset 2. Splitting the Data 3. Creating the Dataset Class 4. roberta-base As Our Model 5. What are the typical raw outputs from Transformers? 6. Defining the Model Class ∘ Using pooler_output ∘ Building your own Transformer custom head ∘ (A) Attention head ∘ (B) Concatenate hidden layers 7. Model Training ∘ Evaluation Metric and Loss Function ∘ Training Function ∘ Validation Function ∘ Run Training Summary References

1. About the Dataset

The dataset that we are going to use can be found on this Kaggle page. This dataset contains around 2800 records. The two important fields that we will be working with are excerpt and target.

Peeking into the data, excerpt is the text to predict reading ease of, and target is a numeric field that can contain positive or negative values. As seen in this dataset, it is a continuous variable with the minimum being -3.676268 and the maximum being 1.711390. Thus, given a specific excerpt, we need to predict the target value.

To give a little background, competition host Scott Crossley had mentioned in this discussion that “the target value is the result of a Bradley-Terry analysis of more than 111,000 pairwise comparisons between excerpts. Teachers spanning grades 3–12, a majority teaching between grades 6–10 served as the raters for these comparisons”.

Higher target values correspond to “easier to read”, and lower values correspond to “more difficult to read”. For example, let’s say we have three excerpts A, B, and C, and their corresponding target values are 1.599999, -1.333333, and -2.888888. This will mean A is easier to read than B, and B is easier to read than C.

To illustrate, below are two sample excerpts.

Excerpt with target value of 1.541672:

More people came to the bus stop just before 9am. Half an hour later they are all still waiting. Sam is worried. "Maybe the bus broke down," he thinks. "Maybe we won't go to town today. Maybe I won't get my new school uniform." At 9:45am some people give up and go home. Sam starts to cry. "We will wait a bit longer," says his mother. Suddenly, they hear a noise. The bus is coming! The bus arrives at the stop at 10 o'clock. "Get in! Get in!" calls the driver. "We are very late today!" People get on the bus and sit down. The bus leaves the stop at 10:10am. "What time is the return bus this afternoon?" asks Sam's mother. "The blue bus leaves town at 2:30pm," replies the driver. Sam thinks, "We will get to town at 11 o'clock." "How much time will we have in town before the return bus?" wonders Sam.

Excerpt with target value of -3.642892:

The iron cylinder weighs 23 kilogrammes; but, when the current has an intensity of 43 amperes and traverses 15 sections, the stress developed may reach 70 kilogrammes; that is to say, three times the weight of the hammer. So this latter obeys with absolute docility the motions of the operator's hands, as those who were present at the lecture were enabled to see. I will incidentally add that this power hammer was placed on a circuit derived from one that served likewise to supply three Hefner-Alteneck machines (Siemens D5 model) and a Gramme machine (Breguet model P.L.). Each of these machines was making 1,500 revolutions per minute and developing 25 kilogrammeters per second, measured by means of a Carpentier brake. All these apparatuses were operating with absolute independence, and had for generators the double excitation machine that figured at the Exhibition of Electricity. In an experiment made since then, I have succeeded in developing in each of these four machines 50 kilogrammeters per second, whatever was the number of those that were running; and I found it possible to add the hammer on a derived circuit without notably affecting the operation of the receivers.

Obviously, of these two excerpts, the former is easier to read compared to the latter.

2. Splitting the Data

Since our dataset is rather small, we will use cross-validation to get a more accurate measure of our model’s performance. As such, we will split the data into training and validation sets using stratified k-fold. With stratified k-fold, the folds are made by preserving the percentage of samples for each class. This method is useful when we have a skewed dataset, or in our case when the distribution of target is not balanced. However, because our target is a continuous variable instead of classes, we need some sort of workaround. This is where binning the target comes to the rescue. The bins are akin to classes, which would then be perfectly fine for scikit-learn’s StratifiedKFold to handle.

The codes are rather straightforward. We randomize the rows of data and reset the row index before calculating the number of bins required to bin the target. One way to do this is to use Sturge’s rule to determine the number of bins to use. Next, we take scikit-learn’s StratifiedKFold class to split the data into 5 folds based on the bins that we have. Finally, the generated fold number (ranging from 0 to 4) will be assigned to a new column called skfold. At the end of the process, the bins are no longer required and can be discarded if you want to.

For your information, the full dataset has a mean target of -0.96 (rounded to 2 decimal places). After splitting into 5 folds, we can see that the distribution shape of target on each fold is preserved. Looking at the plot below, the mean target for each fold is almost consistent and they are indeed very close to -0.96.

Figure 1: Distribution shape of the target variable for each fold. Image by author

3. Creating the Dataset Class

We will now create MyDataset that subclass torch.utils.data.Dataset. Excerpts will be passed in as texts, along with the tokenizer which will be used to tokenize texts. In this process, the tokenizer produces the ids of the tokens (known as input ids) as well as the attention masks that are necessary to feed into our model. An example is illustrated in Figure 2 with input text “Hello how are you?”. If you’re interested, more details about tokenizer, attention mask, padding, and truncation can be found here.

4. roberta-base As Our Model

RoBERTa stands for Robustly Optimized BERT Pre-training Approach, and it was presented by researchers from University of Washington and Facebook in 2019. It is an improved pretraining procedure based on BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, which was released in 2018. We will use RoBERTa along with PyTorch throughout the demonstration, but you can also adapt and use other Transformer models if you want to. Be sure to check the relevant documentation on the Transformer models you use to confirm that they support the inputs and outputs that the codes are using.

There are few variants of RoBERTa classes available at 🤗 Hugging Face. One of them is RobertaModel, referenced here as “the bare RoBERTa Model transformer outputting raw hidden-states without any specific head on top.” In other words, the raw output of bare RobertaModel is the hidden state vector of predefined hidden size corresponding to each token in the input sequence. Using the bare RobertaModel class, we will be adding our own custom regressor head for predicting the target.

For our Transformer fine-tuning task, we will use pretrained roberta-base from 🤗 Hugging Face as our model. As described there, “RoBERTa is a transformers model pretrained on a large corpus of English data in a self-supervised fashion”. roberta-base has a hidden size of 768 and is made up of one embedding layer followed by 12 hidden layers.

Figure 2: An example where tokenizer parameter is set with `max_length=10` and `padding=“max_length”`. Any input sequence that is shorter than 10 after the addition of special tokens will be padded with `<pad>` token. Image by author

5. What are the typical raw outputs from Transformers?

Before we go into creating and defining the model class, we need to understand what the Transformer raw outputs are. This is because we are going to be using the raw outputs to feed our custom regressor head.

Below are the common raw outputs that are usually returned by Transformer models such as BERT, ALBERT, and RoBERTa. They are taken from the documentation here, here, and here.

last_hidden_state: This is sequence of hidden-states at the output of the last layer of the model. It is a tensor of shape (batch_size, sequence_length, hidden_size)
pooler_output: This is the last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. It is a tensor of shape (batch_size, hidden_size). Note that pooler_output may not be available for certain Transformer models.
hidden_states: Optional, returned when output_hidden_states = True is passed. It is a tuple of tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)).

So, what is batch_size, sequence_length, and hidden_size?

Usually, a model processes record by batch. Thus batch_size is the number of records that the model processes before its internal parameters are updated in one forward/backward pass. sequence_length is the value that we set for the tokenizer’s max_length parameter, while hidden_size is the number of features (or elements) in the hidden state. As for tensor, you can visualize it as an n-dimensional array that can be used for arbitrary numeric computation.

Figure 3: `last_hidden_state`, i.e. model output of the last layer. It is a tensor of shape `(batch_size, sequence_length, hidden_size)`. Image by author

6. Defining the Model Class

Here we will create MyModel and subclass nn.Module.

An nn.Module is the base class for all neural network modules, and it contains layers and a method forward that takes the input and returns the output. Other than that, it also contains states and parameters and can loop through them for weight updates or zero their gradients. The forward method is called from the __call__ function of nn.Module. Hence when we run MyModel(inputs), the forward method is called.

Using pooler_output

For any regression or classification task, the simplest implementation is to directly take the pooler_output and append just an additional regressor or classifier output layer.

Particularly in our case, we can define a regressor with one nn.Linear layer as part of our network in the __init__ method. Then in the forward method, we will feed pooler_output into the regressor to produce prediction value for target.

Building your own Transformer custom head

Other than just simply taking the pooler_output, there are many different ways out there that you can define and compose your own layers and custom head. One such example that we will demonstrate is the attention head, which is adapted from here.

🅰️ Attention head

In the forward method, the raw output from last_hidden_state is fed into an instance of another class, AttentionHead (we will talk about AttentionHead in the next paragraph). The output from AttentionHead is then passed into the regressor that we saw earlier.

Well, what’s in the AttentionHead then? There are two linear layers in the AttentionHead. The AttentionHead takes the last_hidden_state into the first linear layer, and go through a tanh (hyperbolic tangent) activation function before moving into the second linear layer. This derives the attention scores. The softmax function is then applied to these attention scores, re-scaling them so that the elements of the tensor lie in the range [0,1] and sum to 1 (well, try to think of it as probabilities distribution). These weights are then multiplied with last_hidden_state, and summation of tensor across the sequence_length dimension finally produces the result of shape (batch_size, hidden_size).

🅱️ Concatenate hidden layers

Another technique that we would like to share is the concatenation of hidden layers. This idea came from BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, where the authors mentioned that with feature-based approach, concatenating the last four hidden layers gave the best performance on their case studies.

“The best performing method concatenates the token representations from the top four hidden layers of the pre-trained Transformer”

You can observe in the codes below how we need to specify output_hidden_states = True when calling our model. This is because we now want to receive and use outputs from other hidden layers and not just the last_hidden_state.

In the forward method, the raw output from hidden_states are stacked, giving us a tensor shape of (layers, batch_size, sequence_length, hidden_size). Since roberta-base has 13 layers in total, this simply translates to tensor shape of (13, batch_size, sequence_length, 768). Next, we concatenate the last four layers on the hidden_size dimension, and this leaves us with a tensor shape of (batch_size, sequence_length, 768*4). After the concatenation, we use the representation of the first token from the sequences. We now have a tensor shape of (batch_size, 768*4), and this is finally input into the regressor.

Figure 4: Concatenation of 4 hidden layers. Image by author

If you’re interested to read on for more examples, take a look at this notebook.

7. Model Training

Alright, let’s proceed to write the training codes for a basic model training process.

As we are not going to touch on advanced techniques for training Transformers in this post, we will just create simple functions. For now, we will need to create a loss function, a training function, a validation function, and finally the main function for running the training.

Since we’re using a pretrained model (and not training one from scratch), model training here is also commonly referred to as a Transformer fine-tuning process.

▶️ Evaluation Metric and Loss Function 📉

To measure the performance of our model, we will use RMSE (root mean squared error) as the evaluation metric.

Wait, what is loss function then? What is it used for? Well, the loss function is meant to gauge the error between prediction output and the provided target value in order to optimize our model. In fact, this is the function that the optimizer will try to minimize.

Sometimes evaluation metrics and loss functions can be different, especially for classification tasks. But in our case, since it’s a regression task, we will use RMSE for both.

As such, we will define our loss function as follows:

def loss_fn(predictions, targets):       
    return torch.sqrt(nn.MSELoss()(predictions, targets))

▶️ Training Function

The train_fn that we are creating will train our model using the training data set. In our main training loop when running training, this function will be called for every epoch.

This function will first set the model in training mode. In essence, it will loop over all batches of training data in the data loader, get predictions for the batches, back-propagate the errors, update parameters based on current gradients and update the learning rate based on the scheduler.

An important note to take is that we need to set the gradients to zero before starting to do back-propagation. This is because PyTorch accumulates the gradients on subsequent backward passes.

In the end, this function will return the training loss and learning rates that it has collected over the batches.

▶️ Validation Function

The validate_fn is used to perform evaluation on our validation data set. It basically will assess how well our model is doing throughout the training process for each epoch. It is quite similar to the train_fn that we wrote above, except that gradient calculation is disabled. Hence there’s no backpropagation of error and no update of parameters and learning rate.

This function will first set the model in evaluation mode. It will loop over all batches of validation data in the data loader, run predictions for the batches on validation data (i.e. data not seen during training), and collect the validation loss which will be returned at the end.

Notes taken from PyTorch documentation here and here:

It is recommended that we always use model.train() when training and model.eval() when evaluating our model (validation/testing), because a module we are using might be updated to behave differently in training and evaluation modes.

Disabling gradient calculation is useful for inference (or validation/testing) when we are sure that we will not call .backward() for back-propagation. It will reduce memory consumption for computations that would otherwise require gradient computation.

▶️ Run Training

Now that we have created the train_fn and validate_fn, let’s proceed to create the main function for running our training.

The top part of this function will do the necessary preparations required for model training. For each fold, it will initialize the tokenizer, fetch and create training and validation data sets and data loaders, load model and send it to the device, and get the optimizer and learning rate scheduler.

Once all these are done, it is ready to go into the training loop. The training loop will call train_fn to do training and then followed by validate_fn to perform model evaluation for each epoch. Generally, the training loss and validation loss should decrease gradually over the epochs. Whenever there’s an improvement on validation loss (remember, the lower it is, the better), the model checkpoint is saved. Else the loop will continue till the last epoch, or when the early stopping threshold is reached. Basically, early stopping is triggered when there’s continuously no improvement on the validation loss after n iteration, where n is the preset threshold.

The function will also plot the training and validation loss, as well as the learning rate schedule at the end of each fold.

Figure 5: Sample plots on the training and validation loss for each fold. Image by author

Summary

Finally, we are reaching the end of this lengthy post. To summarize:

☑️ We learned how to perform stratified k-fold to split data into training and validation sets using scikit-learn’s StratifiedKFold. Particularly in our case, we make use of bins.

☑️ We get the gist of typical raw outputs from Transformers.

☑️ We created and defined our dataset and model classes.

☑️ We explored some examples of custom regressor heads that we can build for our model.

☑️ We went through the fundamentals of model training process and created the necessary functions for it.

That’s not all. Watch out for my next post on how to apply advanced training techniques to fine-tune Transformer models.

Advanced Techniques for Fine-tuning Transformers

Learn these advanced techniques and see how they can help improve results

towardsdatascience.com

If you like my post, don’t forget to hit Follow and Subscribe to get notified via email.

Optionally, you may also sign up for a Medium membership to get full access to every story on Medium.

📑 Visit this GitHub repo for all codes and notebooks that I’ve shared in my post.

References

[1] A. Thakur, Approaching (Almost) Any Machine Learning Problem (2020)

[2] C. Sun, X. Qiu, Y. Xu, and X. Huang, How to Fine-Tune BERT for Text Classification? (2020)

[3] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer and V. Stoyanov, RoBERTa: A Robustly Optimized BERT Pretraining Approach (2019)

[4] J. Devlin, M. Chang, K. Lee and K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018)

[5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.Gomez, L. Kaiser and I. Polosukhin, Attention Is All You Need (2017)