This context provides a comprehensive guide on building a Named Entity Recognition (NER) model using deep learning, specifically leveraging the BERT architecture, and demonstrates how to tag sentences with entities such as persons, locations, and organizations.
Abstract
The guide begins by introducing Named Entity Recognition (NER) as a task that benefits significantly from deep learning techniques. It explains that NER involves identifying and classifying named entities into predefined categories, using an example sentence to illustrate entities like 'Nick' as a 'Person' and 'Greece' as a 'Location'. The tutorial uses the HugginFace Transformers library and the wnut_17 dataset to build and train the NER model. It covers data preparation, including loading and exploring the dataset, reorganizing training and validation data, and tokenization with special considerations for sub-word tokenization used by BERT. The preprocessing step aligns tokens and labels to handle the mismatch caused by tokenization. The guide also discusses fine-tuning the BERT model, establishing a baseline for comparison, and using metrics such as precision, recall, and F1-score for evaluation. It emphasizes the importance of not relying solely on accuracy due to dataset imbalance. Finally, it shows how to get predictions from the model and provides per-class metrics for the test set, concluding with a method for tagging custom sentences.
Opinions
The author suggests that deep learning, particularly BERT, is highly effective for NER tasks, surpassing simpler models like linear classifiers.
The tutorial advocates for the use of HugginFace Transformers for NLP tasks, highlighting its user-friendly API.
The author emphasizes the importance of tokenization alignment to ensure correct labeling during model training.
The guide recommends using early stopping during training to prevent overfitting and to achieve a balance between training time and model performance.
The author encourages the use of seqeval for evaluating NER models, as it provides a more comprehensive assessment than accuracy alone, considering precision, recall, and F1-score.
The author provides a positive endorsement for using the wnut_17 dataset, which focuses on emerging and rare entities, as a benchmark for NER tasks.
The article concludes with a call to action, inviting readers to subscribe to the author's newsletter, follow on LinkedIn, and join Medium, indicating the author's desire to build a community and share knowledge further.
Named Entity Recognition with Deep Learning (BERT) — The Essential Guide
From data preparation to model training for NER tasks — and how to tag your own sentences
Nowadays, NLP has become synonymous with Deep Learning.
But, Deep Learning is not the ‘magic bullet’ for every NLP task. For example, in sentence classification tasks, a simple linear classifier could work reasonably well. Especially if you have a small training dataset.
However, some NLP tasks flourish with Deep Learning. One such task is Named Entity Recognition — NER:
NER is the process of identifying and classifying named entities into predefined entity categories.
For instance, in the sentence:
Nick lives in Greece and works a Data Scientist.
We have 2 entities:
Nick, which is a ‘Person’.
Greece, which isa ‘Location’.
Therefore, given the above sentence, a classifier should be able to locate the two terms (‘Nick’, ‘Greece’) and correctly classify them as ‘Person’ and ‘Location’ respectively.
In this tutorial, we will build a NER model, using HugginFace Transformers.
Let’s dive in!
I’ve launched AI Horizon Forecast, a newsletter focusing on time-series and innovative AI research. Subscribe here to broaden your horizons!
Load Data
We will use the wnut_17[1] dataset that is already included in the HugginFace Datasets library.
Explore the dataset
This dataset focuses on identifying unusual, previously-unseen entities in the context of emerging discussions. It contains 5690 documents, partitioned into training, validation, and test sets. The text sentences are tokenized into words. Let’s load the dataset:
wnut= load_dataset(“wnut_17”)
We get the following:
Next, we print the ner_tags — the predefined entities of our model:
Each ner_tag describes an entity. It can be one of the following: corporation, creative-work, group, location, person, and product.
The letter that prefixes each ner_tag indicates the token position of the entity:
B- indicates the beginning of an entity.
I- indicates a token is contained inside the same entity (e.g., the “York” token is a part of the “New York” entity).
0 indicates the token doesn’t correspond to any entity.
We also created the id2tagdictionary that maps each label to its ner_tag — this will come in handy later.
Reorganize train & validation datasets
Our dataset is not that large. Remember, Transformers require lots of data to take advantage of their superior performance.
To solve this issue, we concatenate training and validation datasets into a single training dataset. The test dataset will remain as-is for validation purposes:
A training example
Let’s print the 3rd training example from our dataset. We will use that example as a reference throughout this tutorial:
The ‘Pxleyes’ token is classified as B-corporation (the beginning of a corporation). The rest of the tokens are irrelevant — they do not represent any entity.
Preprocessing
Next, we tokenize our data. Contrary to other use cases, tokenization for NER tasks requires special handling.
We will use the bert-base-uncased model and tokenizer from the HugginFace library.
Transformer models mostly use sub-word-based tokenizers.
During tokenization, some words could be split into two or more words. This is a standard practice because rare words could be decomposed into meaningful tokens. For example, BERT models implement by default the Byte-Pair Encoding (BPE) tokenization.
Let’s tokenize our sample training example to see how this works:
This is the original training example:
And this is how the training example is tokenized by BERT’s tokenizer:
Notice that there are two significant issues:
The special tokens [CLS] and [SEP] are added.
The token “Pxleyes” is split into 3 sub-tokens : p, ##xleyand ##es.
In other words, the tokenization creates a mismatch between the inputs and the labels. Hence, we realign tokens and labels in the following way:
Each single word token is mapped to its corresponding ner_tag.
We assign the label -100 to the special tokens [CLS] and [SEP] so the loss function ignores them. By default, PyTorch ignores the -100value during loss calculation.
For subwords, we only label the first token of a given word. Thus, we assign -100 to other subtokens from the same word.
For example, the token Pxleyes is labeled as 1 (B-corporation). It is tokenized as [‘p’, ‘##xley’, ‘##es’] and after token alignment the labels should become [1, -100, -100]
We implement this functionality in the tokenize_and_align_labels() function:
And that’s it! Let’s call our custom tokenization function:
The table below shows exactly the tokenization output for our sample training example:
Fine-Tuning the Model
We are now ready to build our Deep Learning model.
We load the bert-base-uncased pretrained model and fine-tune it using our data.
But first, we should train a naive classifier to use as a baseline model.
Baseline Model
The most obvious choice for a baseline classifier is to tag every token with the most frequent entity throughout the entire training dataset— the O entity:
The baseline classifier becomes less naive if we tag each token with the most frequent label of the sentence it belongs:
Therefore, we use the second model as a baseline.
BERT for Named Entity Recognition
The Data Collator batches training examples together while applying padding to make them all the same size. The collator pads not only the inputs but also the labels:
Regarding evaluation, since our dataset is imbalanced, we can’t rely only on accuracy.
Therefore, we will also measure precision and recall. Here, we will load the seqeval metric which is included in the datasets library. This metric is commonly used for POS (Part-of-speech) tagging and NER tasks.
Let’s apply it to our reference training example and see how this works:
Note: Remember, the loss function ignores all tokens tagged with -100 during training. Our evaluation function should also take into account this information.
Hence, the compute_metricsfunction is defined a bit differently — we calculate precision, recall, f1-score, and accuracy by ignoring everything tagged with -100:
Finally, we instantiate the Trainer class to fine-tune our model. Notice the usage of the EarlyStopping callback:
These are our training metrics:
The model achieves much better validation accuracy compared to the baseline model. Also, we can achieve a better f1-score if we use a larger model, or let the model train for more epochs without applying the EarlyStopping callback.
Test Set Evaluation
We use the same methodology as before for our test set.
The seqeval metric also outputs the per-class metrics:
The location and person entities achieve the best scores, while group has the lowest score.
Get Predictions
Finally, we create a function that performs entity recognition on our own sentences:
Let’s try a few examples:
The model has successfully tagged the two countries! Take a look at the United States:
“United” was correctly tagged as B-location.
“States” was correctly tagged as I-location.
Again, Apple was correctly tagged as a corporation. Also, our model correctly identified and recognized the Apple products.
Closing Remarks
Named Entity Recognition is a fundamental NLP task that has numerous practical applications.
Even though the HugginFace library has created a super-friendly API for this process, there are still a few points of confusion.
I hope this tutorial has shed some light on them. The source code of this article can be found here