avatarViviana Márquez

Summary

CyNER is a Python library for cybersecurity named entity recognition that combines transformed-based models, heuristics, and general NLP models, offering flexibility in entity extraction strategies and requiring user-defined training for optimal results.

Abstract

The CyNER library, introduced earlier this year, is designed to facilitate cybersecurity named entity recognition by integrating three distinct strategies: transformer-based models for context-aware entity extraction, heuristics for pattern-based extraction of indicators of compromise (IOCs), and established NLP models like Spacy and Flair for general entity recognition. Users can customize the model's strategy and merging order of outputs. The library's GitHub repository provides a Jupyter Notebook demo, but users must train the model using a custom dataset or the provided MITRE database-annotated dataset to achieve the demonstrated results. The process involves installing the library via pip, setting up a configuration for training with specified parameters, and then running the model to extract entities from text. The article also addresses a common error encountered during model execution and provides a solution by modifying the library's installation path.

Opinions

  • The author finds the provided Jupyter Notebook demo insufficient without prior model training and fine-tuning, prompting them to elaborate on the necessary steps for effective use of CyNER.
  • The author appreciates the flexibility of CyNER, highlighting the ability to define the use of different entity extraction strategies and the order in which their outputs are merged.
  • The author acknowledges the utility of the dataset shared by the CyNER authors for training purposes but also encourages the creation of custom datasets following the BIO format.
  • The author encountered an error related to missing attributes in the 'DataParallel' object, suggesting that some dependencies may have changed since CyNER's release, and provides a workaround by directly modifying the library's code.
  • The author expresses gratitude to the GitHub user 'tilusnet' for insights on getting started with CyNER, indicating a collaborative spirit within the cybersecurity community.

How to use CyNER: A Python Library for Cybersecurity Named Entity Recognition

Earlier this year, CyNER, an open-source Python library for Cybersecurity Named Entity Recognition was released. Here are the respective links for the paper and the Github repository. In this post, I will give you a short tutorial on how to get started with it.

Introduction

Before jumping into the code, let’s quickly go over the logic behind CyNER. This model combines the following 3 strategies:

  • Transformed-based models To extract cybersecurity related entities using their context
  • Heuristics To extract IOCs (Indicators of Compromise) that follow a specific pattern that can be extracted using RegEx (For example: IP addresses, CVEs, etc.)
  • Spacy and Flair: To extract general entities that do not fall under cybersecurity but might be of interest. (For example: Company names, countries, etc.)

CyNER is a flexible model that allows the user to define which strategies to use and what order to use when merging outputs from different models.

👩‍💻 The code

The first step is to get started by installing the library with pip install git+https://github.com/aiforsec/CyNER.git

The authors of CyNER provided a Jupyter Notebook with a demo, however, if you run it as is, you won’t get the same results as seen in the notebook, which prompted me to write this post.

First, note that the second cell in the demo is a locally fine-tuned model.

In [2]: model1 = cyner.CyNER(transformer_model=’xlm-roberta-large’, use_heuristic=False, flair_model=None)

Therefore, you need to train it first with the code below:

import cyner
cfg = {'checkpoint_dir': 'MyFolder',
        'dataset': 'dataset/mitre',
        'transformers_model': 'xlm-roberta-large',
        'lr': 5e-6,
        'epochs': 100,
        'max_seq_length': 280}
model = cyner.TransformersNER(cfg)
model.train()
  • checkpoint_dir is the directory that will contain the model’s relevant files such as the weight file. For this parameter you can select any folder that you wish.
  • dataset is the path to the custom dataset to fine-tune your model. The authors of CyNER shared on GitHub a manually-labeled dataset annotated on different cybersecurity incidents from the MITRE database. To use this dataset, you only need to clone their repo and point to the correct folder. Otherwise, you can create your own custom dataset following the BIO format.

Once your model is trained, then you can call it and use it indicating the checkpoint_dir you selected.

text = 'Proofpoint report mentions that the German-language messages were turned off once the UK messages were established, indicating a conscious effort to spread FluBot 446833e3f8b04d4c3c2d2288e456328266524e396adbfeba3769d00727481e80 in Android phones.'
model_run = cyner.CyNER(transformer_model='MyFolder', use_heuristic=False, flair_model=None)
entities = model_run.get_entities(text)
for i,e in enumerate(entities):
    print(i)
    print(e)
    print()

Side note

When running this model, I came across an error that said:

AttributeError: 'DataParallel' object has no attribute 'save_pretrained'

I haven’t fully investigated why this was happening, but most likely, some of the dependencies have changed since the CyNER was released. I solved this by going directly into the library and modifying it.

To find out where a Python library is installed, you can run:

>>> import cyner
>>> cyner.__file__
'/home/vmarquez/anaconda3/envs/myenv/lib/python3.9/site-packages/cyner/__init__.py'
>>>

Then I updated the following lines in the file (replace with your path) ~/anaconda3/envs/myenv/lib/python3.9/site-packages/cyner/tner/model.py

  • Line 319: self.model.module.save_pretrained(self.args.checkpoint_dir)
  • Line 339: self.model.module.from_pretrained(self.args.checkpoint_dir)

I hope you’ve enjoyed reading this post! I want to end by thanking to tilusnet on GitHub for the insight on how to get started.

Cyber
Ner
Cybersecurity
Python
Cyner
Recommended from ReadMedium