Serve NCBI Taxonomy in AWS, Serverlessly
The world is moving towards the cloud computing fast. This is because the cloud is very easy, cheap, accessible and secure. Cloud providers such as Amazon Web Service (AWS) take over many repetitive tedius IT maintenance tasks for their customers. As a result, cloud users can focus on their own business logics. The cloud also provide different pricing options that may be more cost effective than on-premises servers. In addition, unlike its on-premises counterparts, the cloud is very accessible and secure through the internet.
In this tutorial, I am going to show you how I managed a serverless NCBI taxonomy API service in AWS. In microbial bioinformatics, one of the most frequent tasks is the traverse of the NCBI taxonomy, because biologists needs to unambiguously define the taxonomic names in computer programs. This includes these operations: retrieving the scientific names, the taxonomic ranks, the parent taxa and the daughter taxa given a taxid and conversly, retrieving the taxid given a taxonomic name:

The problem: NCBI provides us with all this information in a group of text files, and they are not structured in an easily accessible way. The task is clear: to make this information programmatically accessible. And the access should be fast since these operations are performed very frequently.
Previously, I have written a blog post and a Python library Pyphy just for this purpose. Apart from my approach, there are other implementations out there in the Python world (etetoolkit, ncbi-taxonomist and taxadb). But some drawbacks comes to mind:
- It is language specific. So the users need to write their programs in that language such as Python to make use of the libraries.
- It is local. Whoever wants to use it, needs to set up the whole thing in their local environment. Each local copy needs updates individually afterwards. And you cannot access my installation via internet.
- It scales poorly. Its performance depends on the hardware where it is installed. And there is no way to automatically scale it up when needed.
For this reason, I moved Pyphy into the AWS cloud and make it into a REST API service. It basically solves all the three drawbacks at once:
- It is language agnostic. Not only virtually all programming languages can use REST API, in fact, you can even just use an app like Postman or even a browser to interact.
- It is on the cloud. As long as it is up and running, everybody with access to internet can use it right away without installing anything.
- It scales. It is possible to set it up so that Amazon can increase the computation resources to meet the demand surge.
I have thought about various ways to make Pyphy onto AWS. But finally I settled onto the easiest and cheapest way:
- Backend: Aurora serverless. I uploaded the data via a Cloud9 session.
- Frontend: API gateway with Lambda Function

However, serverless Aurora skips a few calls after a certain period of inactiviy, because it needs to wake up after inactivity (the Stackoverflow discussion and solution is here). Therefore, if you need a persistent service, please consider using the provisioned Aurora.
The finished product looks like:

The code can be found in my Github repository here.
1. Data Preparation
First download the NCBI taxonomy data from its FTP. For my purposes, I only need “names.dmp” and “nodes.dmp”. The file “nodes.dmp” contains the taxid, parent taxid and rank, while “names.dmp” contains the mapping between taxids and their various taxonomic names and synonyms.
To make my database design simple, I consolidated them into two tsv files: “tree.tsv” and “synonym.tsv”. “tree.tsv” keeps the main information in one place, while “synonym.tsv” provides auxillary information about the synonyms for some taxids. I accomplished this task with a Python script “prepyphy.py”. I finished this step on my local machine. Of course, you do it the “cloud native” way: upload the two “dmp” file and my “prepyphy.py” onto AWS Cloud9 and perform the same task.
2. Set up serverless Aurora
Now it is time to head to AWS. After logging into the AWS account, open Amazon RDS and create a severless Aurora: select the “Serverless” under “Capacity type”:

I named my DB cluster identifier “ncbi” and created a master username and master password. This credential is necessary later for the data import.
As to “Connectivity”, create a new VPC “pyphy” for this tutorial. Under “Additional configuration”, please check “Data API” for debugging. All the other options and parameters are default. Click “Create database” to let AWS prepare the database.
3. Import data into Aurora
Now switch to Cloud9. Select the same region as our Aurora database. Start by clicking “Create environment” and name it as “pyphy-import”. In Step 2 “Configure settings”, unfold “Network settings (advance)” and choose the same VPC as our Aurora database. Then go on to finish creating the Cloud9 environment.

3.1. Allow connection bewteen Cloud9 and Aurora
Before moving on, we need to configure the security group so that our Cloud9 environment can communicate with our database cluster. First, click into the detail page of our newly created “pyphy-import” environment and copy the Security groups identifier.

Then head over to RDS page and click into the detail page of our database “ncbi”. Under its “Connectivity & security” tab, take a note of the “Endpoint”. Then click open the “VPC security groups” link.

In the security group page, click “Inbound” and “Edit”, add a rule of type “MYSQL/Aurora”. In the “Source”, select “Custom” and paste the Cloud9 environment security group name into the text field.

3.2. Import tsv files into Aurora
Now we are ready for the data import. In our newly created Cloud9 environment, click “File -> Upload Local Files…” to upload both “synonym.tsv” and “tree.tsv”.
In the console panel in the lower half of the screen, we can use the normal mysql command to log into our Aurora database by:
mysql -h [database endpoint] -P 3306 -u [database master username] -p[database endpoint] is the URL that we wrote down in 3.1. Master username is the one that we set when we created the Aurora database in 2. This command will then ask for the master password from Step 2. Once logged in, we can issue a series of commands to set up a database “pyphydb”. Inside it, we can import and index two tables: “tree” and “synonym”.













