Summary

The web content describes an active learning experiment for an NLP classification problem using the Spooky Author Identification dataset, focusing on systematic data annotation and initial model training.

Abstract

The article explores the application of active learning to improve the performance of a natural language processing (NLP) classification model. The author uses the Spooky Author Identification dataset to simulate the annotation process of an initially unlabelled dataset. The process begins with data preparation and the creation of an oracle for manual annotation. The author then selects specific keywords associated with each author to guide the initial annotation phase, followed by the training of a baseline Naive-Bayes model. The effectiveness of the keyword-based heuristic is evaluated, revealing some imbalance in the resulting dataset, particularly for Mary Shelley's works. Despite the imbalance and a relatively simple model, the author achieves a 64.7% accuracy rate, identifying areas for improvement in subsequent annotation rounds, such as addressing the under-representation of Mary Shelley's texts and refining the model's discrimination between Edgar Allan Poe and Mary Shelley.

Opinions

The author assumes that manual annotators are error-free in labeling the dataset.
The initial keyword-based heuristic for selecting texts to annotate is acknowledged to be less effective for Mary Shelley's works.
The author prioritizes the analysis of the confusion matrix over optimizing the Log-Loss metric, indicating a focus on annotation quality rather than model performance.
The author recognizes the need for a more balanced dataset and suggests that future annotations should focus on under-represented classes to improve model accuracy.
The article suggests that the class imbalance might be an artefact of the labelling procedure or inherent to the dataset.
The author recommends annotating texts where the model shows confusion between Edgar Allan Poe and Mary Shelley to enhance the model's predictive capabilities.

An Active Learning experiment with a NLP classification problem

Introduction

In this article, I want to explore active learning for an NLP classification problem. Specifically, using the Spooky Author Identification competition dataset, I want to label, pretending to start from a completely unlabelled dataset.

The primary references for the active learning are the MIT Introduction to Data-Centric AI course, and, in particular, the third class) and Human in the Loop Machine Learning.

I face the problem of manually annotating a dataset for many projects, and becoming more systematic is the goal of this article. For this article, I assume annotators make no error in the data labelling; I think it is reasonable to assume that extracting text from books and annotating with the author is not error prone.

For this first article, I focus only on the first three steps of the annotation process — the data preparation/ingestion, the first annotation informed by domain knowledge and the training of the first, baseline model.

Preparing the Dataset

As a first step, I read the data and create an oracle returning the right label when queries; this simulates manual annotators reviewing the text.

import pandas as pd
from IPython.display import display
from IPython.display import Markdown

df = pd.read_csv("input/train.csv")
display(Markdown(df.head().to_markdown(index=False)))

Then, I create an unlabelled dataset, an annotate and test that the annotation is working correctly:

unlabelled = df[["id", "text"]].copy()
truth = df[["id", "author"]].copy()


def annotate(truth: pd.DataFrame, tobelabelled: pd.DataFrame) -> pd.DataFrame:
    return tobelabelled.merge(truth, how="left", on="id")


display(Markdown(annotate(truth, unlabelled.iloc[:4, :]).to_markdown(index=False)))

How to start the annotation

I assume here that I know which authors are present in the dataset. Thus, as a first step, I search for specific keywords that I expect by a specific authors:

hpl = ["cthulhu", "madness", "innsmouth", "arkham", "providence"]
mws = ["nongtongpaw", "paris", "john bull", "geneva", "valperga", "maurice"]
eap = ["gordon", "ismael", "julius", "kempelen", "mesmer"]

I count, among the unlabelled data, any occurrence of those words:

len([j for i in hpl for j in unlabelled["text"] if i in str(j).lower()])
len([j for i in mws for j in unlabelled["text"] if i in str(j).lower()])
len([j for i in eap for j in unlabelled["text"] if i in str(j).lower()])

obtaining 220, 101 and 35, respectively. I sample 34, 33 and 33 among these occurrences (to reach 100 annotated examples):

df1 = unlabelled.iloc[
    [k for i in hpl for k, j in enumerate(unlabelled["text"]) if i in str(j).lower()], :
].sample(34, random_state=42)
display(Markdown(df1.head().to_markdown(index=False)))

df2 = unlabelled.iloc[
    [k for i in mws for k, j in enumerate(unlabelled["text"]) if i in str(j).lower()], :
].sample(33, random_state=42)
display(Markdown(df2.head().to_markdown(index=False)))

df3 = unlabelled.iloc[
    [k for i in eap for k, j in enumerate(unlabelled["text"]) if i in str(j).lower()], :
].sample(33, random_state=42)
display(Markdown(df3.head().to_markdown(index=False)))

Validating that the heuristic for selecting the cases to be labelled was effective:

import numpy as np


df1 = annotate(truth, df1)
df2 = annotate(truth, df2)
df3 = annotate(truth, df3)

print(np.mean(df1["author"] == "HPL"))
print(np.mean(df2["author"] == "MWS"))
print(np.mean(df3["author"] == "EAP"))

obtaining 0.824, 0.364, 0.970. Thus, the keywords for Mary Shelley were not that effective. However, I create the labelled dataset by simple concatenation:

labelled = pd.concat((df1, df2, df3))

The distribution of the labels is the following:

display(
    Markdown(labelled["author"].value_counts().reset_index().to_markdown(index=False))
)

The dataset is not very balanced. Whether this is caused by the keyword choice or by the difference frequency of the classes in the dataset, I would not know. I create a utility function to remove the labelled data from the labelled dataset:

def remove_labelled(labelled: pd.DataFrame, unlabelled: pd.DataFrame) -> pd.DataFrame:
    return unlabelled[~(unlabelled["id"].isin(labelled["id"]))].copy()


unlabelled = remove_labelled(labelled, unlabelled)

Then, I labelled a random sample of 50 texts, to validate whether the class imbalance is a feature of the dataset or an artefact of the labelling procedure.

dfr = annotate(truth, unlabelled.sample(50, random_state=42))
unlabelled = remove_labelled(labelled, unlabelled)

In the new dataset, the labels appear:

The frequencies are still unbalanced; frequencies are not too different from the previous ones. This means that the search by keywords was not very effective.

Training a first model

As the goal is not to produce a submission for the competition, but rather mimicking the annotation of an unlabelled dataset, I choose a simple Naive-Bayes model for the baseline performance. With only 100 points to train the model, there is not much room to optimize hyper-parameters.

As I am more concerned with the annotation quality, I am not optimizing the Log-Loss, but focus on the analysis of the confusion matrix.

import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split


vectorizer = CountVectorizer().fit(labelled["text"])
X = vectorizer.transform(labelled["text"])

X_train, X_test, Y_train, Y_test = train_test_split(
    X.toarray(), labelled["author"], test_size=0.333, random_state=42
)
model = MultinomialNB().fit(X_train, Y_train)

cm = pd.DataFrame(confusion_matrix(Y_test, model.predict(X_test)))
cm.columns = ["EAP", "MWS", "HPL"]
cm.index = ["EAP", "MWS", "HPL"]
display(Markdown(cm.to_markdown()))

The confusion matrix is the following:

The accuracy in this case is 64.7%:

accuracy = round(
    100 * ((cm.iloc[0, 0] + cm.iloc[1, 1] + cm.iloc[2, 2]) / cm.sum().sum()), 1
)
print(f"{accuracy}%")

The model is not accurate, and too many authorships are assigned to Edgar Allan Poe. This is probably due to the over-representation of the class in the annotated dataset.

Conclusion

These were the first, preliminary steps in the process of annotating an unlabelled dataset. After this, new annotations would be performed taking into account (and balancing) random exploration of the unlabelled data and the limits of the current model.

As Mary Shelley texts are under-represented, I should annotate texts that my model label as MWS to increase model accuracy in predicting MWS. It may be helpful to annotate cases for which the model can not discriminate well between EAP and MWS to reduce the misclassification.

WRITER at MLearning.ai / 100+ AI agents / Good-Bad AI Art / Sensory

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com