avatarSuleiman Khan, Ph.D.

Summary

Google BERT is an innovative pre-training method for natural language understanding that leverages bidirectional context and transformer architecture to excel in various NLP tasks.

Abstract

Google BERT (Bidirectional Encoder Representations from Transformers) is a groundbreaking approach in natural language processing (NLP) that significantly improves performance on a range of language tasks. It operates in two stages: unsupervised pre-training on a vast corpus of text to learn general language representations, followed by supervised fine-tuning on specific datasets to tailor the model for particular tasks. BERT's novelty lies in its ability to capture context from both sides of a word, unlike previous methods that were unidirectional or used shallow bidirectionality. This is achieved through the Masked Language Model (MLM) and Next Sentence Prediction (NSP) tasks during pre-training. BERT's architecture is based on stacking multiple transformer encoders, utilizing the multi-head attention mechanism. The model was trained on a large dataset comprising Wikipedia articles and the BooksCorpus, and it has demonstrated state-of-the-art results on various NLP benchmarks, even surpassing human performance on some tasks. BERT is openly available for use and can be fine-tuned for a variety of language tasks, although pre-training requires significant computational resources.

Opinions

  • BERT represents a substantial advancement in NLP, outperforming previous state-of-the-art methods.
  • The use of MLM and NSP tasks during pre-training is considered a key factor in BERT's success.
  • BERT's bidirectional context learning is highlighted as a significant improvement over unidirectional or shallow bidirectional methods.
  • The open-source availability of BERT and its pre-trained models in TensorFlow and PyTorch is seen as a valuable resource for the AI community.
  • The computational demands of pre-training BERT are acknowledged, with recommendations to use TPUs or high-end GPUs like Nvidia V100 for efficient training.
  • The release of a multilingual BERT model, trained on Wikipedia data from 104 languages, is recognized as a step towards more inclusive NLP models, despite a slight trade-off in performance compared to single-language models.
  • There is a mention of critique regarding the potential bias introduced by BERT's MLM strategy, although the impact of this bias is not quantified in the provided context.
  • The text concludes with a promotional note, suggesting an AI service that offers similar capabilities to ChatGPT Plus (GPT-4) at a lower cost, indicating an endorsement of this service as a cost-effective alternative for users interested in leveraging advanced NLP capabilities.

BERT Technology introduced in 3-minutes

Google BERT is a pre-training method for natural language understanding that performs various NLP tasks better than ever before.

BERT works in two steps, First, it uses a large amount of unlabeled data to learn a language representation in an unsupervised fashion called pre-training. Then, the pre-trained model can be fine-tuned in a supervised fashion using a small amount of labeled trained data to perform various supervised tasks. Pre-training machine learning models have already seen success in various domains including image processing and natural language processing (NLP).

BERT stands for Bidirectional Encoder Representations from Transformers. It is based on the transformer architecture (released by Google in 2017). The general transformer uses an encoder and a decoder network, however, as BERT is a pre-training model, it only uses the encoder to learn a latent representation of the input text.

Photo by Franki Chamaki on Unsplash

Technology

BERT stacks multiple transformer encoders on top of each other. The transformer is based on the famous multi-head attention module which has shown substantial success in both vision and language tasks. For a review of attention see.

BERT’s state-of-the-art performance is based on two things. First, novel pre-training tasks called Masked Language Model(MLM) and Next Sentense Prediction (NSP). Second, a lot of data and compute power to train BERT.

MLM makes it possible to perform bidirectional learning from the text, i.e. it allows the model to learn the context of each word from the words appearing both before and after it. This was not possible earlier! The previous state-of-the-art methods called Generative Pre-training used left-to-right training and ELMo used shallow bidirectionality.

The MLM pre-training task converts the text into tokens and uses the token representation as an input and output for the training. A random subset of the tokens (15%) are masked, i.e. hidden during the training, and the objective function is to predict the correct identities of the tokens. This is in contrast to traditional training methodologies which used either unidirectional prediction as the objective or used both left-to-right and right-to-left training to approximate bidirectionality. The NSP task allows BERT to learn relationships between sentences by predicting if the next sentence in a pair is the true next or not. For this 50% correct pairs are supplemented with 50% random pairs and the model trained. BERT trains both MLM and NSP objectives simultaneously.

Data and TPU/GPU Runtime

BERT was trained using 3.3 Billion words total with 2.5B from Wikipedia and 0.8B from BooksCorpus. The training was done using TPU, while GPU estimates are shown below.

Training devices and times for BERT; used TPU and estimated for GPU.

Fine-tuning was done using 2.5K to 392K labeled samples. Importantly, datasets above 100K training samples showed robust performance over various hyper-parameters. Each fine-tuning experiment runs within 1 hour on a single cloud TPU and few hours on GPU.

Results

BERT outperforms 11 state-of-the-art NLP tasks with large margins. The tasks fall in three main categories, text classification, textual entailment, and Q/A. On two of the tasks SQUAD and SWAG, BERT is the first to outperform the human level performance!

BERT results from the paperhttps://arxiv.org/abs/1810.04805

Using BERT in your analysis

BERT is available as open source: https://github.com/google-research/bert and pre-trained for 104 languages with implementations in TensorFlow and Pytorch.

It can be fine-tuned for several types of tasks, such as text classification, text similarity, question and answer, text labeling such as parts of speech, named entity recognition etc. However, pre-training BERT can be computationally expensive unless you use TPU’s or GPU’s similar to the Nvidia V100.

BERT folks have also released a single multi-lingual model trained on entire Wikipedia dump of 100 languages. Multilingual BERT is has a few percent lower performance than those trained for a single language.

Critique

The BERT masking strategy in MLM biases the model towards the actual word. The impact of this bias on the training is not shown.

Update: Lately several new methods have been proposed on top of BERT — this blog post talks about which one of them to use?

References

[1] https://cloud.google.com/tpu/docs/deciding-pod-versus-tpu

[2] Assuming second generation TPU, 3rd generation is 8 times faster. https://en.wikipedia.org/wiki/Tensor_processing_unit

[3] http://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/

Machine Learning
Artificial Intelligence
Deep Learning
NLP
Neural Networks
Recommended from ReadMedium