Monitoring BERT Model Training with TensorBoard
Gradient Flow and Update Ratios

In the previous article, we explained all the building components of the BERT model. Now we are going to train the model monitoring the training process in TensorBoard, looking at the gradient flow, updates-parameters ratios, loss and evaluation metrics.
Why would we like to monitor gradients flow and updates ratios instead of simply looking at the loss and evaluation metrics? When we start the model training on a big amount of data, we might run many iterations before realising, looking at the loss and evaluation metrics that the model is not training. Here, looking at the gradients magnitude and updates ratio we can immediately spot that something is wrong which saves us time and money.
Data preparation
We will use 20newsgroups dataset (License: Public Domain / Source: http://qwone.com/~jason/20Newsgroups/) from sklearn in this example with 4 categories : alt.atheism, talk.religion.misc, comp.graphics and sci.space. We tokenize the data with BertTokenizer from the transformers library and wrap them into BertDataset class which inherits from torch.utils.data.Dataset allowing to batch and shuffle the data and conveniently load them into the model.
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)X_train = pd.DataFrame(newsgroups_train['data'])
y_train = pd.Series(newsgroups_train['target'])X_test = pd.DataFrame(newsgroups_test['data'])
y_test = pd.Series(newsgroups_test['target'])BATCH_SIZE = 16max_length = 256
config = BertConfig.from_pretrained("bert-base-uncased")
config.num_labels = len(y_train.unique())
config.max_position_embeddings = max_lengthtrain_encodings = tokenizer(X_train[0].tolist(), truncation=True, padding=True, max_length=max_length)
test_encodings = tokenizer(X_test[0].tolist(), truncation=True, padding=True, max_length=max_length)class BertDataset(Dataset):def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labelsdef __getitem__(self, idx):
item = {key: torch.tensor(val[idx]).to(device) for key, val in
self.encodings.items()}
item[‘labels’] = torch.tensor(self.labels[idx]).to(device)
return itemdef __len__(self):
return len(self.labels)train_dataset = BertDataset(train_encodings, y_train)
test_dataset = BertDataset(test_encodings, y_test)train_dataset_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_dataset_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE)for d in train_dataset_loader:
print(d)
break# output :
{'input_ids': tensor([[ 101, 2013, 1024, ..., 0, 0, 0],
[ 101, 2013, 1024, ..., 1064, 1028, 102],
[ 101, 2013, 1024, ..., 0, 0, 0],
...,
[ 101, 2013, 1024, ..., 2620, 1011, 102],
[ 101, 2013, 1024, ..., 1012, 4012, 102],
[ 101, 2013, 1024, ..., 3849, 2053, 102]], device='cuda:0'),
'token_type_ids': tensor([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], device='cuda:0'),
'attention_mask': tensor([[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 0, 0, 0],
...,
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]], device='cuda:0'),
'labels': tensor([3, 0, 2, 1, 0, 2, 2, 1, 1, 0, 1, 3, 3, 0, 2, 1], device='cuda:0')}TensorBoard usage
TensorBoard allows us to write and save for future analysis different types of data, including images and scalars. First of all let’s install tensorboard with pip:
pip install tensorboard
To write to TensorBoard we will be using the SummaryWriter from torch.utils.tensorboard
from torch.utils.tensorboard import SummaryWriter# SummaryWriter takes log directory as argument
writer = SummaryWriter(‘tensorboard/runs/bert_experiment_1’)To write scalars, we use:
writer.add_scalar(‘loss/train’, loss, counter_train)
The counter_train variable is needed to know the step number at which something was written to TensorBoard. To write an image, we will use the following:
writer.add_figure(“gradients”, myfig, global_step=counter_train, close=True, walltime=None)Model training
Now let’s look at our training function












