How to use XLNET from the Hugging Face transformer library

How to use XLNET from the Hugging Face transformer library for three important tasks

In this article, I will demonstrate how to use XLNET using the Hugging Face Transformer library for three important tasks. I will also show how you can configure XLNET so you can use it for any task that you want, besides just the standard tasks it was designed to solve.

Note that this article was written in April 2022, so earlier/future versions of the Hugging Face library may be different and the code in this article may not work.

A quick review of XLNet

XLNET is a generalized autoregressive model that uses permutative language modeling to create a bidirectional contextualized representation of words. It is notable for building on the weaknesses of the BERT transformer and outperforming BERT in many tasks like question answering, sentiment analysis, and more. While BERT is a very powerful and versatile transformer, its architecture inherently possesses 2 weaknesses. Firstly, because it uses masked language modeling to generate contextualized representations of words, it’s distorting the input, so the way BERT truly uses the masked words is unknown. Secondly, when BERT masks more than one token in a sentence, it fails to capture the dependencies between two masked tokens that may possess important information for each other. Another major, powerful advantage that XLNET has over BERT is that unlike BERT, which has a 512 token input limit, XLNET is one of the few models that has no sequence length limit.

XLNET overcomes these problems by capturing bidirectional context around a word with permutation language modeling. Without masking any words or changing the input, permutation language modeling captures context by training an autoregressive model on all possible permutations of words in a sentence. It maximizes the log-likelihood over all the permutations of a sentence, and therefore, each token in the text learns to utilize contextual information from all the other tokens in the sentence, creating powerful, enriched word representations.

There are many tasks that XLNet can solve, but the ones that I will be going over in this article are Multiple Choice Question Answering, Extractive Question Answering, and Language Modeling. I will also demonstrate how to configure XLNET to do any task that you want besides the ones stated above and that Hugging Face provides.

Note that for all the code/models I show in this article, I am taking all of them directly from the Hugging Face transformer library without any fine-tuning/training. XLNET, like many other versatile transformers, uses an added linear layer on top of the core autoregressive model to fine-tune itself for certain tasks. Although Hugging Face does provide the pre-trained weights for the core model, it does not provide the weights for the linear layer on top. To achieve the best performance for each specific task, this linear layer must be trained for the task being solved, so the results of the code in this article will likely not be very good.

Multiple Choice Question Answering

Multiple choice question answering is simply what the name says. The only reason why I didn’t title this ‘Question Answering’ is because of the other version of question answering: extractive question answering. In extractive question answering, the model tries to find an answer within a context passage/text, as opposed to just choosing between a few answer options like multiple-choice.

Before you run the code below, you have to make sure to run this code to import the required libraries for the code to compile.

pip install transformers 
pip install sentencepiece
pip install torch
## All of these lines can vary depending on what version of 
## each library you use

Here is my code for multiple-choice question answering:

from transformers import XLNetTokenizer, XLNetForMultipleChoice
from torch.nn import functional as F
import torch
tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
model = XLNetForMultipleChoice.from_pretrained("xlnet-base-cased", return_dict = True)
prompt = "What is the capital of France?"
answers = ["Paris", "London", "Lyon", "Berlin"]
encoding = tokenizer([prompt, prompt, prompt, prompt], answers, return_tensors="pt", padding = True)
outputs = model(**{k: v.unsqueeze(0) for k, v in encoding.items()}) 
logits = outputs.logits
softmax = F.softmax(logits, dim = -1)
index = torch.argmax(softmax, dim = -1)
print("The correct answer is", answers[index])

Hugging Face is set up such that for the tasks that it has pre-trained models for, you have to download/import that specific model. In this case, we have to download the XLNET for multiple-choice question answering model, whereas the tokenizer is the same for all the different XLNET models.

We begin by encoding the question and the 4 answer options. The way multiple-choice works is quite straightforward: the model computes a score for how good each answer option is, softmaxes each of these scores to get the probabilistic distributions, and just takes the highest value (the index of the highest value in the tensor is found using torch.argmax). The logits are the output of XLNET model before a softmax function is applied to the output of XLNET. By applying a softmax to the output logits, we get probabilistic distributions for each of the answer options: answer choices with a higher probability means that they are better/the best answer to the question. We can retrieve the index of the answer with the highest probability value using torch.argmax. If you are curious to know what each of the probabilistic values of each of the answer options was (i.e. how the model rated each option), you can simply print out the tensor of softmax values. In my case, this is what it prints (remember that the linear layer on top of this model is not trained, so the values are not good).

tensor([[0.2661, 0.2346, 0.2468, 0.2525]])

In this case, the model correctly predicted the answer was Paris. However, you can see the softmax values are quite close. HuggingFace provides the base, pre-trained architecture to be able to process the question and answer, as well as an untrained linear classifier on top to create the proper output. Looking at these values, it is quite evident that the model needs training to be to achieve good results.

Extractive Question Answering

Extractive Question Answering is the task of answering a question given some context text by outputting the start and end indexes of where the answer lies in the context. Here is my code for extractive question answering using XLNET:

from transformers import XLNetTokenizer 
from transformers import XLNetForQuestionAnsweringSimple
from torch.nn import functional as F
import torch
tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
model = XLNetForQuestionAnsweringSimple.from_pretrained("xlnet-base-cased",return_dict = True)
question = "How many continents are there in the world?"
text = "There are 7 continents in the world."
inputs = tokenizer.encode_plus(question, text, return_tensors='pt')
output = model(**inputs)
start_max = torch.argmax(F.softmax(output.start_logits, dim = -1))
end_max = torch.argmax(F.softmax(output.end_logits, dim=-1)) + 1 
## add one because of python list indexing
answer = tokenizer.decode(inputs["input_ids"][0][start_max : end_max])
print(answer)

Like multiple-choice question answering, we begin by downloading the specific XLNET model for Question Answering, and we tokenize our two inputs: the question and the context. HuggingFace provides two XLNET models to use for extractive question answering: XLNET for Question Answering Simple, and just regular XLNET for Question Answering. You can learn more about both here on the official HuggingFace transformer library page. The process for extractive question answering is slightly different from Multiple Choice. The way extractive question answering works is by computing the best start and end indexes for where the answer is located in the context. The model returns a score for all of the words in context/input corresponding to how good they would be a start value and end value for the given question; in other words, each of the words in the input receives a start and end index score/value representing whether they would be a good start word for the answer or a good end word for the answer. Afterward, we compute the softmax of these scores to find the probabilistic distribution of values, retrieve the highest values for both the start and end tensors using torch.argmax(), and find the actual tokens that correspond to this start : end range in the input and decode them and print them out.

Language Modeling

Language Modeling is the task of predicting the best word to follow/continue a sentence given all the words already in the sentence.

from transformers import XLNetTokenizer, XLNetLMHeadModel
from torch.nn import functional as F
import torch
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
model = XLNetLMHeadModel.from_pretrained('xlnet-base-cased', return_dict = True)
text = "The sky is very clear at " + tokenizer.mask_token
input = tokenizer.encode_plus(text, return_tensors = "pt")
output = model(**input).logits[:, -1, :]
softmax = F.softmax(output, dim = -1)
index = torch.argmax(softmax, dim = -1)
x = tokenizer.decode(index)
print(x)
new_sentence = text.replace(tokenizer.mask_token, x)
print(new_sentence)

We begin by downloading the specific XLNET model for Language Modeling, and we tokenize our inputs: the incomplete sentence (the sentence must have concatenated the mask token to the end of the sentence as I did above). The code is relatively straightforward: we have to retrieve the logits of the model, take the logits of the last hidden state using -1 index (as this corresponds to the last word in the sentence), compute the softmax of these logits (in this case, the softmax creates probabilistic distributions of all the words in XLNET’s vocabulary; word’s with higher probability value will be better candidate replacement words for the mask token), find the largest probability value in the vocabulary, and decode and print this token. In the code above, I am retrieving the word with the highest probability value (i.e. the best candidate word), but if you are curious to know what the top 10 candidate words were (it can be top 10 or any number you like), then here is how you can do that. By using the torch.topk() function instead of torch.argmax(), you can retrieve the top k values in a given tensor, and the function returns a tensor containing those top k values. After this, the process is the same as before: iterate through the tensor, decode each of the candidate words, and replace the mask token in the sentence with the candidate words. Here is the code to do this:

from transformers import XLNetTokenizer, XLNetLMHeadModel
from torch.nn import functional as F
import torch
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
model = XLNetLMHeadModel.from_pretrained('xlnet-base-cased', return_dict = True)
text = "The sky is very clear at " + tokenizer.mask_token
mask_index = torch.where(input["input_ids"][0] == tokenizer.mask_token_id)
input = tokenizer.encode_plus(text, return_tensors = "pt")
output = model(**input).logits
softmax = F.softmax(output, dim = -1)
mask_word = softmax[0, mask_index, :]
top_10 = torch.topk(mask_word, 10, dim = 1)[1][0]
for token in top_10:
  word = tokenizer.decode([token])
  new_sentence = text.replace(tokenizer.mask_token, word)
  print(new_sentence)

Using XLNET for any task you want

Although question answering, language modeling, and the other tasks Hugging Face provides that XLNET can solve are fundamentally important in NLP, people often want to use transformers like XLNET for other unique tasks, especially in research. The way they do this is by taking the core, base XLNET model, and then attaching their own specific neural networks to it (which is usually a linear layer). They then fine-tune this architecture on their specific dataset for their specific task. In Pytorch, it is best to set this up as a Pytorch deep learning model like this:

from transformers import XLNetModel
import torch.nn as nn
class XLNet_Model(nn.Module):
  def __init__(self, classes):
    super(XLNet_Model, self).__init__()
    self.xlnet = XLNetModel.from_pretrained('xlnet-base-cased')
    self.out = nn.Linear(self.xlnet.config.hidden_size, classes)
  def forward(self, input):
    outputs = self.xlnet(**input)
    out = self.out(outputs.last_hidden_state)
    return out

Instead of downloading a specific XLNET model already designed for a specific task like Question Answering, I downloaded the base, pre-trained XLNET model, and added a linear layer to it. To get the raw, core output of the XLNET model, use xlnet.config.hidden_size (the actual value of this is 768) and attach this to the number of classes you want your linear layer to output.

I hope that you found this content easy to understand. If you think that I need to elaborate further or clarify anything, drop a comment below.

References

Hugging Face Transformer Library