avatarbtd

Summary

The webpage content outlines various hierarchical classification error metrics for evaluating the performance of models in complex, multi-level classification problems.

Abstract

The content discusses the nuances of hierarchical or multi-level classification, emphasizing the necessity for specialized error metrics to accommodate the nested categories involved in such classifications. It introduces eight hierarchical classification error metrics: Hierarchical Precision, Recall, and F1 Score; Subset Accuracy; Hamming Loss; Macro-F1 and Micro-F1 Score; Label Ranking Average Precision (LRAP); Precision and Recall at k; Example-Wise F1 Score; and Top-1 and Top-N Accuracy. Each metric is described in detail with its advantages and considerations, alongside code snippets using scikit-learn that illustrate their computation. The article highlights the importance of choosing appropriate metrics that can handle the complexity of label hierarchies and provides guidance on implementing these metrics in practical scenarios.

Opinions

  • The author suggests that hierarchical precision, recall, and F1 score are particularly useful for hierarchical classification problems as they extend traditional metrics and provide insights into performance at different levels.
  • Subset accuracy is praised for measuring the percentage of instances where the entire set of predicted labels matches the true set of labels exactly, but it is acknowledged to be sensitive to strict matching.
  • The article notes that Hamming Loss, while useful for measuring incorrect labels independently, does not inherently account for hierarchical relationships between labels.
  • Macro-F1 and Micro-F1 Scores are highlighted as valuable for multi-label classification, with the former treating all labels equally and the latter considering label frequencies.
  • The author conveys that Label Ranking Average Precision (LRAP) is a robust metric that evaluates the quality of a model’s label rankings and handles varying label sets.
  • Precision at k and Recall at k are considered application-specific and particularly relevant for top-K recommendations where only the highest-ranked predictions are of interest.
  • The Example-Wise F1 Score is presented as a metric that adapts the F1 score to hierarchical classification by considering label hierarchies, but it is mentioned that this metric does not explicitly take hierarchical structure into account.
  • Finally, Top-1 and Top-N Accuracy metrics are seen as useful for evaluating the model’s ability to predict the most relevant labels, though they also do not inherently consider hierarchical label structures.

Evaluating Multi-level Categorization: 8 Hierarchical Classification Error Metrics

Photo by Aperture Vintage on Unsplash

Hierarchical or multi-level classification problems involve the classification of instances into multiple levels or layers of nested categories. Evaluating the performance of models in such scenarios requires metrics that can handle the complexity introduced by the hierarchical structure of the labels. Here are some error metrics suitable for hierarchical classification problems:

Hierarchical Classification Metrics:

1. Hierarchical Precision, Recall, and F1 Score:

  • Hierarchical precision, recall, and F1 score are typically calculated using the micro, macro, or weighted averaging schemes provided by scikit-learn.
  • Advantage: Extend traditional precision, recall, and F1 score to hierarchical structures.
  • Consideration: Defined for each level of the hierarchy, providing insights into performance at different levels.
from sklearn.metrics import precision_score, recall_score, f1_score

# Example true labels and predicted labels for a hierarchical classification problem
true_labels = [[1, 0, 0, 1], [0, 1, 1, 0], [1, 1, 0, 0]]
predicted_labels = [[1, 0, 0, 1], [0, 1, 0, 0], [1, 1, 1, 0]]

# Flatten the true and predicted labels to calculate micro/macro/weighted metrics
flat_true_labels = [label for sublist in true_labels for label in sublist]
flat_predicted_labels = [label for sublist in predicted_labels for label in sublist]

# Calculate micro, macro, and weighted precision, recall, and F1 score
micro_precision = precision_score(flat_true_labels, flat_predicted_labels, average='micro')
macro_precision = precision_score(flat_true_labels, flat_predicted_labels, average='macro')
weighted_precision = precision_score(flat_true_labels, flat_predicted_labels, average='weighted')

micro_recall = recall_score(flat_true_labels, flat_predicted_labels, average='micro')
macro_recall = recall_score(flat_true_labels, flat_predicted_labels, average='macro')
weighted_recall = recall_score(flat_true_labels, flat_predicted_labels, average='weighted')

micro_f1 = f1_score(flat_true_labels, flat_predicted_labels, average='micro')
macro_f1 = f1_score(flat_true_labels, flat_predicted_labels, average='macro')
weighted_f1 = f1_score(flat_true_labels, flat_predicted_labels, average='weighted')

print("Micro Precision:", micro_precision)
print("Macro Precision:", macro_precision)
print("Weighted Precision:", weighted_precision)

print("Micro Recall:", micro_recall)
print("Macro Recall:", macro_recall)
print("Weighted Recall:", weighted_recall)

print("Micro F1 Score:", micro_f1)
print("Macro F1 Score:", macro_f1)
print("Weighted F1 Score:", weighted_f1)

We flatten the true and predicted labels to create a 1D array, and then we calculate micro, macro, and weighted precision, recall, and F1 score using scikit-learn’s precision_score, recall_score, and f1_score functions. Adjust the averaging parameter based on your specific requirements.

Note: If your label data has a different structure, you may need to adjust the flattening process accordingly. Additionally, these metrics are computed at the label level, not considering the hierarchy explicitly. If you have a specific hierarchy structure, you might need to create a custom evaluation metric that takes the hierarchy into account.

2. Subset Accuracy (Exact Match Ratio):

  • Subset accuracy, also known as the exact match ratio, is a metric used in multi-label classification problems, including hierarchical classification scenarios. It measures the percentage of instances where the predicted set of labels exactly matches the true set of labels.
  • Advantage: Measures the percentage of instances where the entire set of predicted labels matches the true set of labels.
  • Consideration: Sensitive to the strictness of matching; may be less forgiving for partial matches.
from sklearn.metrics import accuracy_score

# Example true labels and predicted labels for a hierarchical classification problem
true_labels = [[1, 0, 0, 1], [0, 1, 1, 0], [1, 1, 0, 0]]
predicted_labels = [[1, 0, 0, 1], [0, 1, 0, 0], [1, 1, 1, 0]]

# Flatten the true and predicted labels
flat_true_labels = [set(labels) for labels in true_labels]
flat_predicted_labels = [set(labels) for labels in predicted_labels]

# Calculate subset accuracy
subset_accuracy = accuracy_score(flat_true_labels, flat_predicted_labels)

print("Subset Accuracy (Exact Match Ratio):", subset_accuracy)

The accuracy_score function from scikit-learn is used with sets of labels. Each set represents the true and predicted labels for an instance. The accuracy_score function then calculates the subset accuracy, which is the ratio of instances where the predicted set of labels exactly matches the true set of labels.

Note: The sets are used to account for unordered label sets, meaning that the order of labels within a set does not matter. Adjust the flattening process based on the structure of your label data.

3. Hamming Loss:

  • Hamming Loss is a metric used in multi-label classification problems, including hierarchical classification scenarios. It measures the average fraction of incorrect labels, considering each label independently. The lower the Hamming Loss, the better the model’s performance.
  • Advantage: Measures the average proportion of incorrect labels.
  • Consideration: Treats each label independently and does not consider label hierarchies.
from sklearn.metrics import hamming_loss

# Example true labels and predicted labels for a hierarchical classification problem
true_labels = [[1, 0, 0, 1], [0, 1, 1, 0], [1, 1, 0, 0]]
predicted_labels = [[1, 0, 0, 1], [0, 1, 0, 0], [1, 1, 1, 0]]

# Flatten the true and predicted labels
flat_true_labels = [label for sublist in true_labels for label in sublist]
flat_predicted_labels = [label for sublist in predicted_labels for label in sublist]

# Calculate Hamming Loss
loss = hamming_loss(flat_true_labels, flat_predicted_labels)

print("Hamming Loss:", loss)

The hamming_loss function from scikit-learn is used to calculate the Hamming Loss. The true labels and predicted labels are flattened to 1D arrays before passing them to the function. The Hamming Loss is then calculated based on the fraction of incorrectly predicted labels.

Note: If your label data has a different structure, you may need to adjust the flattening process accordingly. Additionally, Hamming Loss treats each label independently and does not consider the hierarchical structure explicitly. If you have a specific hierarchy structure, you might need to create custom evaluation metrics that take the hierarchy into account.

4. Macro-F1 Score and Micro-F1 Score:

  • Macro-F1 Score and Micro-F1 Score are metrics used in multi-label classification problems, including hierarchical classification scenarios. Both metrics are variants of the traditional F1 Score, adapted for multi-label scenarios. They provide different ways of aggregating F1 scores across multiple labels.
  • Micro-F1 Score: It considers all instances and labels equally and calculates a single F1 score based on the overall counts of true positives, false positives, and false negatives.
  • Macro-F1 Score: It calculates the F1 score for each label independently and then takes the average across all labels. Each label is treated equally, regardless of its frequency.
  • Advantage: Adaptations of F1 score for hierarchical classification.
from sklearn.metrics import f1_score, precision_recall_fscore_support

# Example true labels and predicted labels for a hierarchical classification problem
true_labels = [[1, 0, 0, 1], [0, 1, 1, 0], [1, 1, 0, 0]]
predicted_labels = [[1, 0, 0, 1], [0, 1, 0, 0], [1, 1, 1, 0]]

# Flatten the true and predicted labels
flat_true_labels = [label for sublist in true_labels for label in sublist]
flat_predicted_labels = [label for sublist in predicted_labels for label in sublist]

# Calculate Micro-F1 Score
micro_f1 = f1_score(flat_true_labels, flat_predicted_labels, average='micro')

# Calculate Macro-F1 Score
# precision_recall_fscore_support returns precision, recall, F1 score, and support for each label
precision, recall, macro_f1, _ = precision_recall_fscore_support(flat_true_labels, flat_predicted_labels, average='macro')

print("Micro-F1 Score:", micro_f1)
print("Macro-F1 Score:", macro_f1)

5. Label Ranking Average Precision (LRAP):

  • Label Ranking Average Precision (LRAP) is a metric used in multi-label classification problems, including hierarchical classification scenarios. It evaluates the quality of a model’s predicted label rankings. LRAP is the average over samples of the ratio of true positive rankings to the total number of possible true rankings.
  • Advantage: Measures the average precision of the true labels in the predicted ranking.
  • Consideration: Handles hierarchies and varying label sets.
from sklearn.metrics import label_ranking_average_precision_score

# Example true labels and predicted labels for a hierarchical classification problem
true_labels = [[1, 0, 0, 1], [0, 1, 1, 0], [1, 1, 0, 0]]
predicted_labels = [[1, 0, 0, 1], [0, 1, 0, 0], [1, 1, 1, 0]]

# Calculate Label Ranking Average Precision (LRAP)
lrap = label_ranking_average_precision_score(true_labels, predicted_labels)

print("Label Ranking Average Precision (LRAP):", lrap)

The label_ranking_average_precision_score function from scikit-learn is used to calculate LRAP. The true labels and predicted labels are assumed to be binary-encoded lists representing the presence or absence of each label. The LRAP is then calculated based on the predicted label rankings.

Note: If your label data has a different structure, you may need to adjust the input data accordingly. Additionally, LRAP treats each label independently and does not explicitly consider the hierarchical structure. If you have a specific hierarchy structure, you might need to create custom evaluation metrics that take the hierarchy into account.

6. Precision at k and Recall at k (P@k, R@k):

  • Precision at K (P@K) and Recall at K (R@K) are metrics used to evaluate the performance of a model in top-K recommendations. These metrics are particularly relevant in scenarios where only the top-ranked predictions matter, such as recommendation systems or information retrieval tasks.
  • Advantage: Evaluates precision and recall at the top-k predicted labels.
  • Consideration: Useful for assessing performance in scenarios where only the most relevant labels matter.
# Example true labels and predicted scores for a recommendation system
true_labels = [1, 0, 1, 1, 0, 0, 1, 0, 1]
predicted_scores = [0.8, 0.6, 0.7, 0.9, 0.4, 0.2, 0.5, 0.3, 0.6]

# Combine true labels and predicted scores for sorting
data = list(zip(true_labels, predicted_scores))

# Sort by predicted scores in descending order
sorted_data = sorted(data, key=lambda x: x[1], reverse=True)

# Set the value of K
k = 3

# Take the top-K predictions
top_k_predictions = [label for label, _ in sorted_data[:k]]

# Calculate Precision at K
precision_at_k = sum(top_k_predictions) / k

# Calculate Recall at K
num_true_positives_at_k = sum(top_k_predictions)
num_actual_positives = sum(true_labels)
recall_at_k = num_true_positives_at_k / num_actual_positives

print(f'Precision at {k}: {precision_at_k}')
print(f'Recall at {k}: {recall_at_k}')

The true_labels represent the actual binary labels (1 for relevant, 0 for not relevant), and predicted_scores represent the predicted scores or probabilities assigned by the model. The code sorts the predictions based on their scores in descending order and then calculates Precision at K and Recall at K for the top-K predictions.

Note: Precision at K and Recall at K are application-specific and depend on the nature of the task. Additionally, these metrics do not explicitly consider the hierarchical structure of labels. If you have a specific hierarchy structure, you might need to create custom evaluation metrics that take the hierarchy into account.

7. Example-Wise F1 Score:

  • Example-wise F1 Score is a metric used in multi-label classification problems, including hierarchical classification scenarios. It calculates the F1 score for each instance independently and then averages the F1 scores across all instances.
  • Advantage: Adapts F1 score to hierarchical classification by considering the hierarchy of labels.
  • Consideration: Provides a measure of classification accuracy that considers label hierarchies.
from sklearn.metrics import f1_score

# Example true labels and predicted labels for a hierarchical classification problem
true_labels = [[1, 0, 0, 1], [0, 1, 1, 0], [1, 1, 0, 0]]
predicted_labels = [[1, 0, 0, 1], [0, 1, 0, 0], [1, 1, 1, 0]]

# Calculate Example-wise F1 Score
example_wise_f1_scores = []

for true_instance, predicted_instance in zip(true_labels, predicted_labels):
    f1_instance = f1_score(true_instance, predicted_instance, average='binary')
    example_wise_f1_scores.append(f1_instance)

# Calculate the average Example-wise F1 Score
average_example_wise_f1 = sum(example_wise_f1_scores) / len(example_wise_f1_scores)

print("Example-wise F1 Scores:", example_wise_f1_scores)
print("Average Example-wise F1 Score:", average_example_wise_f1)

The f1_score function from scikit-learn is used with average='binary' to calculate the F1 score for each instance independently. The calculated F1 scores are then averaged to obtain the Example-wise F1 Score.

Note: If your label data has a different structure, you may need to adjust the input data accordingly. Additionally, Example-wise F1 Score treats each instance independently and does not explicitly consider the hierarchical structure. If you have a specific hierarchy structure, you might need to create custom evaluation metrics that take the hierarchy into account.

8. Top-1 and Top-N Accuracy:

  • Top-1 and Top-N Accuracy are metrics commonly used in classification tasks, including hierarchical or multi-label classification scenarios. These metrics assess the model’s ability to correctly predict the most likely class (Top-1) or the top-N most likely classes.
  • Advantage: Measures the proportion of instances where the true label is in the top-1 or top-N predicted labels.
  • Consideration: Useful for evaluating the model’s ability to predict the most relevant labels.
import numpy as np

# Example true labels and predicted probabilities for a hierarchical classification problem
true_labels = [2, 0, 1, 1, 0]
predicted_probs = [
    [0.1, 0.6, 0.3],  # Class 1 is the most likely
    [0.7, 0.2, 0.1],  # Class 0 is the most likely
    [0.2, 0.4, 0.4],  # Class 1 is the most likely
    [0.3, 0.3, 0.4],  # Class 2 is the most likely
    [0.5, 0.4, 0.1]   # Class 0 is the most likely
]

# Calculate Top-1 Accuracy
top_1_predictions = np.argmax(predicted_probs, axis=1)
top_1_accuracy = np.mean(top_1_predictions == true_labels)

# Calculate Top-N Accuracy (let's use Top-2 as an example)
top_n_predictions = np.argsort(predicted_probs, axis=1)[:, -2:]  # Select top 2 predictions
top_n_accuracy = np.mean(np.any(top_n_predictions == true_labels[:, np.newaxis], axis=1))

print("Top-1 Accuracy:", top_1_accuracy)
print("Top-2 Accuracy:", top_n_accuracy)

np.argmax is used to find the index of the maximum predicted probability for Top-1 Accuracy, and np.argsort is used to find the indices of the top-N predicted probabilities for Top-N Accuracy. The code then checks if the true label is among the top-N predictions and calculates the accuracy accordingly.

Note: The example assumes that class indices are used for labels, and the example uses Top-2 Accuracy. You can adjust the value of N based on your specific requirements. Additionally, these metrics do not explicitly consider the hierarchical structure of labels. If you have a specific hierarchy structure, you might need to create custom evaluation metrics that take the hierarchy into account.

Data Science
Machine Learning
Multilabel Classification
Error Metrics
Recommended from ReadMedium