avatarDmytro Iakubovskyi

Summary

A BERT transformer model has been trained on a dataset of common and randomly generated passwords to create a password strength checker with a weighted accuracy of 99.4% for passwords up to 10 symbols in length.

Abstract

In the realm of cybersecurity, the article discusses the development of a password strength checker using the BERT (Bidirectional Encoder Representations from Transformers) model. The model is trained on a combined dataset of 2 million passwords, including 1 million of the most common passwords sourced from public datasets and 1 million complex, randomly generated passwords. The training utilizes Google's BERT case-sensitive model with 108 million trainable parameters and is conducted via a Kaggle notebook, leveraging the computational power of NVIDIA TESLA P100 GPUs. The result is a significant improvement in accuracy from approximately 50% to 99.4%. The final model is made publicly available through HuggingFace, with a note on the necessity to use a cased tokenizer for accurate case sensitivity distinction.

Opinions

  • The author believes that enhancing password security is crucial in the current cyber threat landscape.
  • The use of BERT, a state-of-the-art model in Natural Language Understanding, is considered effective for password strength evaluation.
  • The combination of common and randomly generated passwords in the training dataset is seen as a robust approach to improve the model's performance.
  • The author emphasizes the accessibility and reproducibility of their work by providing the training code and making the model available on Kaggle and HuggingFace.
  • The author suggests that the model's performance is reasonable and useful, inviting feedback and further discussion in the comments section.

Boosting Password Security with Natural Language Understanding: Building a Simple Password Strength Checker with BERT Transformer

The final model is trained on 2 million most common and randomly generated passwords, works well (weighted accuracy 99.4%) for passwords not exceeding 10 symbols, and can be freely used via Huggingface

Photo by Kasia Derenda on Unsplash

In an era where cyber threats are more pervasive than ever, ensuring the security of online accounts is of paramount importance. Passwords are often the first line of defense against unauthorized access, making their strength a critical factor in safeguarding our digital lives.

In this article, I show how to enhance password security by harnessing the power of the BERT (Bidirectional Encoder Representations from Transformers) transformer model, one of the most common up-to-date publicly available models in Natural Language Understanding.

The first step is to take the publicly available dataset of about 1 million of the most common passwords, also publicly available in Kaggle, and to mix them with an equal sample of 1 million randomly generated complex passwords with lengths between 6 and 10 symbols, including lower and upper cases letters, digits, and common special characters.

Then, I use one of the available pre-trained HuggingFace models to further train the data — Google’s BERT case-sensitive model — which has about 108 million trainable parameters.

The final code for data selection and training is available as a Kaggle notebook.

The training process takes about 45 minutes using NVIDIA TESLA P100 GPU available for Kaggle users, and increases the overall accuracy (based on the test set) from about 50% to 99.4%:

Source: author, Passwords strength checker BERT | Kaggle

Picking some of the data samples also shows a reasonable performance of the model:

Source: author, Passwords strength checker BERT | Kaggle

The final model can be freely used via HuggingFace. Note however that in hosted inference API, due to the default (uncased) tokenizer, there is no difference between lower and upper case letters. To use the model with the correct tokenizer, one needs to invoke the model as follows:

# Use a pipeline as a high-level helper - need to specify cased tokenizer
from transformers import pipeline

pipe = pipeline("text-classification", model="dima806/strong-password-checker-bert", tokenizer="bert-base-cased")

I hope these results will be useful for you. In case of questions/comments, do not hesitate to write in the comments below.

Passwords
Machine Learning
Kaggle
Hugging Face
Naturallanguageprocessing
Recommended from ReadMedium