Evaluating Multi-level Categorization: 8 Hierarchical Classification Error Metrics

Hierarchical or multi-level classification problems involve the classification of instances into multiple levels or layers of nested categories. Evaluating the performance of models in such scenarios requires metrics that can handle the complexity introduced by the hierarchical structure of the labels. Here are some error metrics suitable for hierarchical classification problems:

Hierarchical Classification Metrics:

1. Hierarchical Precision, Recall, and F1 Score:

Hierarchical precision, recall, and F1 score are typically calculated using the micro, macro, or weighted averaging schemes provided by scikit-learn.
Advantage: Extend traditional precision, recall, and F1 score to hierarchical structures.
Consideration: Defined for each level of the hierarchy, providing insights into performance at different levels.

from sklearn.metrics import precision_score, recall_score, f1_score

# Example true labels and predicted labels for a hierarchical classification problem
true_labels = [[1, 0, 0, 1], [0, 1, 1, 0], [1, 1, 0, 0]]
predicted_labels = [[1, 0, 0, 1], [0, 1, 0, 0], [1, 1, 1, 0]]

# Flatten the true and predicted labels to calculate micro/macro/weighted metrics
flat_true_labels = [label for sublist in true_labels for label in sublist]
flat_predicted_labels = [label for sublist in predicted_labels for label in sublist]

# Calculate micro, macro, and weighted precision, recall, and F1 score
micro_precision = precision_score(flat_true_labels, flat_predicted_labels, average='micro')
macro_precision = precision_score(flat_true_labels, flat_predicted_labels, average='macro')
weighted_precision = precision_score(flat_true_labels, flat_predicted_labels, average='weighted')

micro_recall = recall_score(flat_true_labels, flat_predicted_labels, average='micro')
macro_recall = recall_score(flat_true_labels, flat_predicted_labels, average='macro')
weighted_recall = recall_score(flat_true_labels, flat_predicted_labels, average='weighted')

micro_f1 = f1_score(flat_true_labels, flat_predicted_labels, average='micro')
macro_f1 = f1_score(flat_true_labels, flat_predicted_labels, average='macro')
weighted_f1 = f1_score(flat_true_labels, flat_predicted_labels, average='weighted')

print("Micro Precision:", micro_precision)
print("Macro Precision:", macro_precision)
print("Weighted Precision:", weighted_precision)

print("Micro Recall:", micro_recall)
print("Macro Recall:", macro_recall)
print("Weighted Recall:", weighted_recall)

print("Micro F1 Score:", micro_f1)
print("Macro F1 Score:", macro_f1)
print("Weighted F1 Score:", weighted_f1)

We flatten the true and predicted labels to create a 1D array, and then we calculate micro, macro, and weighted precision, recall, and F1 score using scikit-learn’s precision_score, recall_score, and f1_score functions. Adjust the averaging parameter based on your specific requirements.

Note: If your label data has a different structure, you may need to adjust the flattening process accordingly. Additionally, these metrics are computed at the label level, not considering the hierarchy explicitly. If you have a specific hierarchy structure, you might need to create a custom evaluation metric that takes the hierarchy into account.

2. Subset Accuracy (Exact Match Ratio):

Subset accuracy, also known as the exact match ratio, is a metric used in multi-label classification problems, including hierarchical classification scenarios. It measures the percentage of instances where the predicted set of labels exactly matches the true set of labels.
Advantage: Measures the percentage of instances where the entire set of predicted labels matches the true set of labels.
Consideration: Sensitive to the strictness of matching; may be less forgiving for partial matches.

from sklearn.metrics import accuracy_score

# Example true labels and predicted labels for a hierarchical classification problem
true_labels = [[1, 0, 0, 1], [0, 1, 1, 0], [1, 1, 0, 0]]
predicted_labels = [[1, 0, 0, 1], [0, 1, 0, 0], [1, 1, 1, 0]]

# Flatten the true and predicted labels
flat_true_labels = [set(labels) for labels in true_labels]
flat_predicted_labels = [set(labels) for labels in predicted_labels]

# Calculate subset accuracy
subset_accuracy = accuracy_score(flat_true_labels, flat_predicted_labels)

print("Subset Accuracy (Exact Match Ratio):", subset_accuracy)

The accuracy_score function from scikit-learn is used with sets of labels. Each set represents the true and predicted labels for an instance. The accuracy_score function then calculates the subset accuracy, which is the ratio of instances where the predicted set of labels exactly matches the true set of labels.

Note: The sets are used to account for unordered label sets, meaning that the order of labels within a set does not matter. Adjust the flattening process based on the structure of your label data.

3. Hamming Loss:

Hamming Loss is a metric used in multi-label classification problems, including hierarchical classification scenarios. It measures the average fraction of incorrect labels, considering each label independently. The lower the Hamming Loss, the better the model’s performance.
Advantage: Measures the average proportion of incorrect labels.
Consideration: Treats each label independently and does not consider label hierarchies.

from sklearn.metrics import hamming_loss

# Example true labels and predicted labels for a hierarchical classification problem
true_labels = [[1, 0, 0, 1], [0, 1, 1, 0], [1, 1, 0, 0]]
predicted_labels = [[1, 0, 0, 1], [0, 1, 0, 0], [1, 1, 1, 0]]

# Flatten the true and predicted labels
flat_true_labels = [label for sublist in true_labels for label in sublist]
flat_predicted_labels = [label for sublist in predicted_labels for label in sublist]

# Calculate Hamming Loss
loss = hamming_loss(flat_true_labels, flat_predicted_labels)

print("Hamming Loss:", loss)

The hamming_loss function from scikit-learn is used to calculate the Hamming Loss. The true labels and predicted labels are flattened to 1D arrays before passing them to the function. The Hamming Loss is then calculated based on the fraction of incorrectly predicted labels.

Note: If your label data has a different structure, you may need to adjust the flattening process accordingly. Additionally, Hamming Loss treats each label independently and does not consider the hierarchical structure explicitly. If you have a specific hierarchy structure, you might need to create custom evaluation metrics that take the hierarchy into account.

4. Macro-F1 Score and Micro-F1 Score:

Macro-F1 Score and Micro-F1 Score are metrics used in multi-label classification problems, including hierarchical classification scenarios. Both metrics are variants of the traditional F1 Score, adapted for multi-label scenarios. They provide different ways of aggregating F1 scores across multiple labels.
Micro-F1 Score: It considers all instances and labels equally and calculates a single F1 score based on the overall counts of true positives, false positives, and false negatives.
Macro-F1 Score: It calculates the F1 score for each label independently and then takes the average across all labels. Each label is treated equally, regardless of its frequency.
Advantage: Adaptations of F1 score for hierarchical classification.

from sklearn.metrics import f1_score, precision_recall_fscore_support

# Example true labels and predicted labels for a hierarchical classification problem
true_labels = [[1, 0, 0, 1], [0, 1, 1, 0], [1, 1, 0, 0]]
predicted_labels = [[1, 0, 0, 1], [0, 1, 0, 0], [1, 1, 1, 0]]

# Flatten the true and predicted labels
flat_true_labels = [label for sublist in true_labels for label in sublist]
flat_predicted_labels = [label for sublist in predicted_labels for label in sublist]

# Calculate Micro-F1 Score
micro_f1 = f1_score(flat_true_labels, flat_predicted_labels, average='micro')

# Calculate Macro-F1 Score
# precision_recall_fscore_support returns precision, recall, F1 score, and support for each label
precision, recall, macro_f1, _ = precision_recall_fscore_support(flat_true_labels, flat_predicted_labels, average='macro')

print("Micro-F1 Score:", micro_f1)
print("Macro-F1 Score:", macro_f1)

5. Label Ranking Average Precision (LRAP):

Label Ranking Average Precision (LRAP) is a metric used in multi-label classification problems, including hierarchical classification scenarios. It evaluates the quality of a model’s predicted label rankings. LRAP is the average over samples of the ratio of true positive rankings to the total number of possible true rankings.
Advantage: Measures the average precision of the true labels in the predicted ranking.
Consideration: Handles hierarchies and varying label sets.

from sklearn.metrics import label_ranking_average_precision_score

# Example true labels and predicted labels for a hierarchical classification problem
true_labels = [[1, 0, 0, 1], [0, 1, 1, 0], [1, 1, 0, 0]]
predicted_labels = [[1, 0, 0, 1], [0, 1, 0, 0], [1, 1, 1, 0]]

# Calculate Label Ranking Average Precision (LRAP)
lrap = label_ranking_average_precision_score(true_labels, predicted_labels)

print("Label Ranking Average Precision (LRAP):", lrap)

The label_ranking_average_precision_score function from scikit-learn is used to calculate LRAP. The true labels and predicted labels are assumed to be binary-encoded lists representing the presence or absence of each label. The LRAP is then calculated based on the predicted label rankings.

Note: If your label data has a different structure, you may need to adjust the input data accordingly. Additionally, LRAP treats each label independently and does not explicitly consider the hierarchical structure. If you have a specific hierarchy structure, you might need to create custom evaluation metrics that take the hierarchy into account.

6. Precision at k and Recall at k (P@k, R@k):

Precision at K (P@K) and Recall at K (R@K) are metrics used to evaluate the performance of a model in top-K recommendations. These metrics are particularly relevant in scenarios where only the top-ranked predictions matter, such as recommendation systems or information retrieval tasks.
Advantage: Evaluates precision and recall at the top-k predicted labels.
Consideration: Useful for assessing performance in scenarios where only the most relevant labels matter.

# Example true labels and predicted scores for a recommendation system
true_labels = [1, 0, 1, 1, 0, 0, 1, 0, 1]
predicted_scores = [0.8, 0.6, 0.7, 0.9, 0.4, 0.2, 0.5, 0.3, 0.6]

# Combine true labels and predicted scores for sorting
data = list(zip(true_labels, predicted_scores))

# Sort by predicted scores in descending order
sorted_data = sorted(data, key=lambda x: x[1], reverse=True)

# Set the value of K
k = 3

# Take the top-K predictions
top_k_predictions = [label for label, _ in sorted_data[:k]]

# Calculate Precision at K
precision_at_k = sum(top_k_predictions) / k

# Calculate Recall at K
num_true_positives_at_k = sum(top_k_predictions)
num_actual_positives = sum(true_labels)
recall_at_k = num_true_positives_at_k / num_actual_positives

print(f'Precision at {k}: {precision_at_k}')
print(f'Recall at {k}: {recall_at_k}')

The true_labels represent the actual binary labels (1 for relevant, 0 for not relevant), and predicted_scores represent the predicted scores or probabilities assigned by the model. The code sorts the predictions based on their scores in descending order and then calculates Precision at K and Recall at K for the top-K predictions.

Note: Precision at K and Recall at K are application-specific and depend on the nature of the task. Additionally, these metrics do not explicitly consider the hierarchical structure of labels. If you have a specific hierarchy structure, you might need to create custom evaluation metrics that take the hierarchy into account.

7. Example-Wise F1 Score:

Example-wise F1 Score is a metric used in multi-label classification problems, including hierarchical classification scenarios. It calculates the F1 score for each instance independently and then averages the F1 scores across all instances.
Advantage: Adapts F1 score to hierarchical classification by considering the hierarchy of labels.
Consideration: Provides a measure of classification accuracy that considers label hierarchies.

from sklearn.metrics import f1_score

# Example true labels and predicted labels for a hierarchical classification problem
true_labels = [[1, 0, 0, 1], [0, 1, 1, 0], [1, 1, 0, 0]]
predicted_labels = [[1, 0, 0, 1], [0, 1, 0, 0], [1, 1, 1, 0]]

# Calculate Example-wise F1 Score
example_wise_f1_scores = []

for true_instance, predicted_instance in zip(true_labels, predicted_labels):
    f1_instance = f1_score(true_instance, predicted_instance, average='binary')
    example_wise_f1_scores.append(f1_instance)

# Calculate the average Example-wise F1 Score
average_example_wise_f1 = sum(example_wise_f1_scores) / len(example_wise_f1_scores)

print("Example-wise F1 Scores:", example_wise_f1_scores)
print("Average Example-wise F1 Score:", average_example_wise_f1)

The f1_score function from scikit-learn is used with average='binary' to calculate the F1 score for each instance independently. The calculated F1 scores are then averaged to obtain the Example-wise F1 Score.

Note: If your label data has a different structure, you may need to adjust the input data accordingly. Additionally, Example-wise F1 Score treats each instance independently and does not explicitly consider the hierarchical structure. If you have a specific hierarchy structure, you might need to create custom evaluation metrics that take the hierarchy into account.

8. Top-1 and Top-N Accuracy:

Top-1 and Top-N Accuracy are metrics commonly used in classification tasks, including hierarchical or multi-label classification scenarios. These metrics assess the model’s ability to correctly predict the most likely class (Top-1) or the top-N most likely classes.
Advantage: Measures the proportion of instances where the true label is in the top-1 or top-N predicted labels.
Consideration: Useful for evaluating the model’s ability to predict the most relevant labels.

import numpy as np

# Example true labels and predicted probabilities for a hierarchical classification problem
true_labels = [2, 0, 1, 1, 0]
predicted_probs = [
    [0.1, 0.6, 0.3],  # Class 1 is the most likely
    [0.7, 0.2, 0.1],  # Class 0 is the most likely
    [0.2, 0.4, 0.4],  # Class 1 is the most likely
    [0.3, 0.3, 0.4],  # Class 2 is the most likely
    [0.5, 0.4, 0.1]   # Class 0 is the most likely
]

# Calculate Top-1 Accuracy
top_1_predictions = np.argmax(predicted_probs, axis=1)
top_1_accuracy = np.mean(top_1_predictions == true_labels)

# Calculate Top-N Accuracy (let's use Top-2 as an example)
top_n_predictions = np.argsort(predicted_probs, axis=1)[:, -2:]  # Select top 2 predictions
top_n_accuracy = np.mean(np.any(top_n_predictions == true_labels[:, np.newaxis], axis=1))

print("Top-1 Accuracy:", top_1_accuracy)
print("Top-2 Accuracy:", top_n_accuracy)

np.argmax is used to find the index of the maximum predicted probability for Top-1 Accuracy, and np.argsort is used to find the indices of the top-N predicted probabilities for Top-N Accuracy. The code then checks if the true label is among the top-N predictions and calculates the accuracy accordingly.

Note: The example assumes that class indices are used for labels, and the example uses Top-2 Accuracy. You can adjust the value of N based on your specific requirements. Additionally, these metrics do not explicitly consider the hierarchical structure of labels. If you have a specific hierarchy structure, you might need to create custom evaluation metrics that take the hierarchy into account.