I Built the Same Virus Knowledge Graph on Gemini Cloud, AuraDB and Neo4j Desktop
No-code vs. Low-code vs. Code; Cloud vs. On-prem
You will learn:
1. how to use the no-code Gemini Cloud to construct a virus knowledge graph
2. to construct the same knowledge graph on Neo4j Desktop and AuraDB
3. some fun facts about influenza, COVID-19, monkeypox, and the common cold
4. the pros and cons

Recently, knowledge graphs are gaining traction quickly. Big tech companies are using them frequently, such as Google’s Infobox, Amazon Music, and Microsoft Academic Knowledge Graph. That should not be a surprise because graphs are excellent at managing knowledge. They store information like a human, that is, via subject-verb-object triples. We can transform many existing data into knowledge graphs and then learn a lot by exploring, searching, and analyzing them.
The construction is straightforward: we break down our domain knowledge into a series of subject-verb-object triples and write them into a collection of CSV files. We then import them into a graph database, such as Neo4j. Voilà! And you can see how I built my knowledge graph for the Carbohydrate-Active enZYmes (CAZy) in this article.
During data preparation, we need to pay extra attention to the data quality so that the graph represents our knowledge fully. This step usually took the most time. But the steps that follow are important, too. Because there can be many CSV files that represent many types of nodes and relations in a given project, a good data importer can ease the process considerably. After the import, a good GUI can help us explore the content and discover new insights. Finally, it would be nice if the graph platform has integrated statistics or even machine learning in its interface.
Neo4j has always been my graph database of choice. Its query language Cypher is easy to learn and write. The database is powerful and fast. Neo4j has powered my medical chatbot Doctor.ai (1, 2, 3, 4, 5, 6, 7, and 8). On-prem, we can use Neo4j Desktop to prototype. On the cloud, we can either deploy Neo4j on an EC2, or simply use Neo4j’s own fully-managed AuraDB. No matter which one you choose, some basic Cypher is required. That could be a speed bump for some knowledge graph newbies. So for people who want to dive headfirst into the knowledge graph, is there any no-code alternative?
Fortunately, the answer is a yes. A startup called Gemini Data from San Francisco has debuted Gemini Explore and its online platform Gemini Cloud in GraphConnect 2022. It is an end-to-end no-code, cloud-native graph database application. Users can import, explore and search the graph data just by clicking and typing. They can also bring their own Neo4j or AuraDB databases to Gemini. Its interface is similar to Neo4j Bloom, but it comes with the much-needed aggregate functions. So Neo4j users should feel quite at home with Gemini Cloud. In the future release, Gemini will be able to do semantic searches, like Doctor.ai or Ask Data from Tableau. And it will also be able to connect to our local graph databases. Under the hood, Gemini stores the user graph data either in Neo4j (default), TigerGraph, Dgraph, ArangoDB, JanusGraph or AWS Neptune. Currently, the Gemini Cloud is in closed beta. And it will be available in August 2022. And a guided tour will also be available by then.
Gemini Data has kindly provided me with a trial version. It came at the right time because I needed to build a new knowledge graph about the viruses myself. So I made it with Gemini Cloud, Neo4j Desktop, and AuraDB. In this article, I want to document the processes and explain their pros and cons. And we are going to learn something interesting about influenza, COVID-19, monkeypox, and the common cold along the way. The data and scripts for this project are hosted on my GitHub repository here.
1. Data preparation
The data comes from two sources: the Kyoto Encyclopedia of Genes and Genomes (KEGG) and the Virus-Host Database (VHDB). On the one hand, the KEGG database has kept a comprehensive list of viruses, their proteins, and drugs that target them. On the other hand, the VHDB complements KEGG with detailed information about the viral hosts and diseases. Because VHDB has normalized its data entities with the KEGG identifiers, it was quite easy to link the two data sources together. I have downloaded the database in TSV format from VHDB. Then I used KEGG’s REST API to download all the information for my knowledge graph. Finally, I formatted them into CSV.
Because the same data will be imported into three platforms, I separate the data into two types of files: node and relation. The node files contain an ID column and other property fields. They define the nodes. And the relation files connect the nodes using their IDs. They look like so.



2. Import in Gemini Cloud
The free Gemini only supports CSV import, while the paid version will support many other data sources and types. Once the data were ready, I logged into my Gemini console and created a new project called virus_kg. I clicked the Data Modeling button and afterward the + Create Flow button to begin the first data import.
Gemini Cloud breaks down the data modeling process into subprocesses called flows. In each flow, we just work with one CSV file (Figure 3). It turns out that the import order is also important because the late imports overwrite the early ones. After examining my data, Gemini taught me to import the relation files before the node files. So I began with the host_taxon_connections.csv file like this (Figure 3).

This file establishes the taxonomic hierarchy in my knowledge graph. Taxonomy organizes organisms with shared characteristics together as a taxonomic group and assigns the group to a taxonomic rank. This group can be merged with other relative groups to form a more inclusive one of higher rank. The host_taxon_connections.csv contains two columns: from and to. In each line, the from column is the taxon to which the to column taxon belongs. I filled both columns with the NCBI taxids. In Step 3, I created two Taxon nodes for both columns. I also set the taxid as their node property names. For Neo4j users, Name Tag is Gemini’s way of saying node label. According to Gemini, I needed to remove the check mark next to Unique Key but activate the one next to taxid (Figure 4).

Next, I defined the HAS_TAXON relation between the from and the to nodes (Figure 5).

Then it came to the Preview page. It visualized how the nodes and the relation appear in the graph (Figure 6).

On the Review page, I simply clicked Close instead of Start. So I could run all the flows later in the overview stage. It gave me the chance to modify the flows in case of misconfiguration.

After configuring the host_taxon_connections flow, I worked with the host taxon nodes. I created a new flow with the host_taxon.csv as its source (Figure 8).

On the Create Node(s) page, Gemini taught me to enter the property that I would like to display later in the knowledge graph under the Select A Source Field field. For the Taxon node, that property would be name. Again, I set Taxon as the node label. And I also added all the properties, such as taxid, name, and rank from the CSV file to the node. Here, I also unchecked Unique Key next to the Select A Source Field, but checked the one next to the taxid (Figure 9).

Since it was a node importing flow, there was no need to define any relation in it. After the configurations of the host_taxon_connections.csv and host_taxon.csv, I simply repeated the process and imported all the files (Figure 10).

The overview page also allows me to review and run the flows. If any of the flow fails, I can click into that flow, go through the wizard pages, and fix the error.
3. Data exploration in Gemini Cloud
3.1 Influenza
After the import, I went to Gemini’s Explore Nodes/Exploration mode to browse the knowledge graph. In the following figure, you can see that as I typed influ, the search bar immediately suggested the Disease node in its dropdown (Figure 11).

Then Gemini Cloud executed the search and the Influenza node appeared on the canvas. When I left-clicked the node, I could read the node properties, such as its description and category. Next, I right-clicked the node to open the context menu in order to explore its neighbors (Figure 12).

Gemini Cloud showed two types of nodes connected to this Influenza node: Drug (6 nodes) and Taxon (5 nodes) (Figure 12). Here, I chose the drug nodes (Figure 13).

Gemini Cloud allows me to select a group of nodes and calculate some statistics easily. Here, I calculated the average molecular weight of the six drugs just by selecting them and choosing avg in the drop-down on the Profiler panel.

Gemini allows us to do both conditional display and conditional search based on node properties. For example, I scaled the node size according to the molecular weight in Figure 14. The higher the molecular weight, the larger the node. This is called conditional display. And it can be set up in Node Appearance (the color palette icon).
3.2 COVID-19
Next, I clicked the TARGETS relation and opened 200 nodes/relations in the graph. The TARGETS relations connect drugs and their viral target proteins (Kegg). It was not hard to find the COVID-19 cluster.

Then I expanded K24152‘s neighbors a bit, select all the nodes, and clicked Isolate in the context menu. As a result, Gemini removed all the other nodes except this cluster.

This little research immediately made it clear that the vaccines and Remdesivir act on different parts of the coronaviruses. We can see that SARS, SARS-CoV2 (the pathogen behind COVID-19), as well as the Bat coronavirus all have the spike glycoprotein (S) and replicase polyprotein. The spike protein is targeted by a number of vaccines, while the replicase is targeted by Remdesivir.
I played a bit more with my knowledge graph in Gemini Cloud. And I am still impressed by its feature-rich interface and context menus. The graph rendering was fast and responsive. And it provides us with lots of control over the visualizations, too.
4. Import in Neo4j Desktop
Next, I imported the same data into my Neo4j Desktop on my local computer. The CSV files were first transferred to the project’s import folder and then imported into the database with the following commands. Please read “1. Import the data into Neo4j” from this article if you need more details.









