Doccano is a versatile open-source text annotation tool for various NLP tasks, such as text classification, sequence labeling, and sequence-to-sequence tasks.
Abstract
Doccano is an open-source text annotation tool that supports multiple NLP tasks, including text classification, sequence labeling, and sequence-to-sequence tasks. It allows users to create labeled data for sentiment analysis, named entity recognition, text summarization, and more. With its user-friendly interface and collaborative features, Doccano simplifies the process of building training datasets for machine learning models. The tool can be installed locally using pip or Docker, and supports various input formats such as TextFile, TextLine, JSONL, and CONLL. Doccano also provides statistics and metrics about the annotated entities, which can help identify class imbalances.
Bullet points
Doccano is an open-source text annotation tool for multiple NLP tasks.
Supports text classification, sequence labeling, and sequence-to-sequence tasks.
Allows users to create labeled data for sentiment analysis, named entity recognition, text summarization, etc.
Can be installed locally using pip or Docker.
Supports various input formats, including TextFile, TextLine, JSONL, and CONLL.
Offers a user-friendly interface and collaborative features.
Provides statistics and metrics about annotated entities.
Doccano — A Tool To Annotate Text Data To Train Custom NLP Models
Garbage in, garbage out: good-quality data is the fuel to robust ML engines
I recently had to build an NLP model for a very specific task in a very technical domain.
This was challenging since I didn’t know anything about the subject matter and no training data was available.
As a data scientist, I first tried researching open-source datasets, pre-trained models, and scientific articles but I finally came to the conclusion that nothing really matched what I was aiming for and that the only way to solve my problem was to build a training dataset from the ground up.
→ Manual labeling had to be done 🤷♂
To this end, I reviewed multiple text labeling tools and the one I finally picked was Doccano.
Screenshot by the author
In this post, I’ll tell you more about this tool: I’ll first go over its multiple features and show you how to install it and set it up for your teams.
Then, I’ll present a concrete example in which Doccano helps with annotating data for a named entity recognition (NER) task.
This post is a personal review. I am not endorsed by Doccano or any of its core developers in writing it.
Sometimes you need to invest time in building a labeled dataset ⏳
With the rise of very powerful models powered by massive transformers, we tend to think that a lot of today’s NLP tasks are solved.
→ If you have access to labeled data, fine-tuning a pretrained transformer on your downstream task (classification, translation, NER) will probably get you good or even state-of-the-art (SOTA) results
→ If you don’t have labeled data, you can load a transformer that’s been pretrained on a similar task and apply it as-is
→ If you don’t have access to training data or a pretrained model, you can sometimes perform zero-shot learning. You load a transformer, indicate your class labels, and perform predictions against them.
Although they seem like pure magic, pretrained models (and transformers) are not the solution to every single NLP problem.
In fact, without the right domain knowledge that’s incorporated into the data, pretrained models can only have a broad view, especially if they’re trained on general-purpose corpora (👋 Wikipedia)
That’s why annotating data and building training datasets can be a reasonable solution when you’re faced with a problem that requires a lot of domain expertise.
It sure seems painful at first but it’s valuable in the long run.
Introducing Doccano
After reviewing 3 or 4 text annotation tools (both open-source and proprietary), I finally decided to invest time in doccano.
As described in the official documentation, Doccano is
“an open source text annotation tool for humans. It provides annotation features for text classification, sequence labeling and sequence to sequence tasks. (..), you can create labeled data for sentiment analysis, named entity recognition, text summarization and so on. Just create a project, upload data and start annotating. You can build a dataset in hours.”
GIF modified by the author
To be able to use doccano locally, you have two options:
1 — Install locally with pip
You can spin it up in a few seconds on your computer.
Simply run this command to install it.
pip install doccano
Since doccano natively uses SQLite as a database, you can also install it with PostgreSQL. In that case, you’ll need to run the following command:
Just enter the username and the password you defined previously and you’re good to go.
Screenshot by the author
2 — Docker or Docker Compose
You can also start doccano using Docker, which is a good alternative if you want to deploy it on the cloud.
Docker allows you to reduce the different steps you previously executed manually into a few commands that pull the doccano image from Dockerhub, create and start a container after setting up the credentials, the database, and the networking.
Doccano makes labeling an easy process. Instead of enumerating its features, let’s create an annotation project and start labeling. We’ll see the interesting stuff along the way.
→ Separate/independent projects for each of your tasks
After starting Doccano, head over to the sign-in page and enter your credentials. Once logged in, you’ll be presented with a list of the different annotation projects you launched in the past.
Screenshot by the author
→ Multiple types of annotation
Hit the Create button to launch a new project and select the appropriate task (text classification, sequence labeling, sequence to sequence, etc.)
Let’s select the sequence labeling task for example. This is suitable for tasks such as named entity recognition.
→ Easy project setup
Once you select a task, Doccano will ask you to define a project name, a description, some tags, whether or not to have overlapping entities, relation labeling, or shared annotations between users.
Screenshot by the author
→ Configure the project
Before starting the annotation process, some configurations should be set first.
Define the labels: for example, the entities you’re interested in detecting. You can also attribute a color code to each one of them.
Create the users: Doccano is a collaborative tool in which multiple users can work on the same project.
Once the data is uploaded, you can view it on a table and start annotating it.
→ Start the annotation process
I like Doccano’s annotation interface: it’s user-friendly, simple, and gives you access to everything you need:
Interactive labeling from the interface
Completion rate
Comments and metadata to enrich the annotation
GIF by the author
→ Export annotated data
When you’re done, you can export the annotations into JSONL format.
Screenshot by the author
→ Metrics
Doccano provides some statistics about the entities you annotated. This may be useful to spot class imbalance
Screenshot by the author
Wrap up
Doccano is a great tool to collaborate on annotation tasks. If you need to create a machine learning dataset and you have no experience in this area, you should definitely give it a shot.
As an ongoing open-source project, Doccano doesn’t include all the shiny features that we see on enterprise (and costly) solutions like Labelbox.
For example, it doesn’t provide specialized annotation tools for computer vision, SSO, or API integration.
Nevertheless, you can still do a lot of things, in no time without investing too much.
New to Medium? You can subscribe for $5 per month and unlock unlimited articles on various topics (tech, design, entrepreneurship…) You can support me by clicking on my referral link