Harnessing the Power of Knowledge Graphs: Enriching an LLM with Structured Data

A step-by-step guide to creating a knowledge graph and exploring its potential to enhance an LLM

For accompanying code, see notebook here.

In recent years, large language models (LLMs), have become ubiquitous. Perhaps the most famous LLM is ChatGPT, which was released by OpenAI in November 2022. ChatGPT is able to generate ideas, give personalized recommendations, understand complicated topics, act as a writing assistant, or help you build a model to predict the Academy Awards. Meta has announced their own LLM called LLaMA, Google has LaMDA, and there is even an open-source alternative, BLOOM.

LLMs have excelled in natural language processing (NLP) tasks like the ones listed above because LLMs have historically focused on unstructured data — data that does not have a pre-defined structure, and is usually text-heavy. I asked ChatGPT, “why have LLMs historically focused on unstructured data?” The reply was:

“LLMs have historically focused on unstructured data due to its abundance, availability, and the challenges it poses. Unstructured data provides a vast source for training language models, allowing them to learn patterns, context, and semantics. LLMs excel in handling the complexities and nuances of unstructured data, capturing the subtleties of language and leveraging the rich context, opinions, and creative expressions it contains. While efforts are underway to bridge the gap between LLMs and structured data, the historical emphasis on unstructured data stems from its unique characteristics and the valuable information it offers, allowing LLMs to provide more human-like responses.”

Structured data is usually quantitative and well-organized, usually into rows and columns, and, as ChatGPT notes, there is still a gap between LLMs and structured data.

Knowledge graphs (KG), on the other hand, are excellent at querying structured data. A knowledge graph is,

“directed labeled graph in which domain specific meanings are associated with nodes and edges. A node could represent any real-world entity, for example, people, company, computer, etc. An edge label captures the relationship of interest between the two nodes, for example, a friendship relationship between two people, a customer relationship between a company and person, or a network connection between two computers, etc.” (Chaudhri et al., 2022).

KGs allow one to integrate heterogenous data sources, including combinations of structured, semi-structured, and unstructured data. KGs are used for organizing data, drawing inferences, creating recommendations, and semantic search.

As Shirui Pan et al. point out in their paper, “Unifying Large Language Models and Knowledge Graphs: A Roadmap,” the two models can be complementary. Some of the main weaknesses of LLMs, that they are black-box models and struggle with factual knowledge, are some of the KGs’ greatest strengths. KGs are, essentially, collections of facts, and they are fully interpretable.

How LLMs and KGs can be complementary. From Shirui Pan et al., 2023. <https://arxiv.org/abs/2306.08302>

Shirui et al. propose many potential ways LLMs and KGs can complement each other. In this tutorial, I will show how to create a KG from structured data, and then use that KG as part of the input prompt to the LLM, something called in-context learning. I will compare the LLM’s responses when using the KG as part of the input with the LLM’s responses when using the original structured data as part of the input prompt.

The methodology for this tutorial is as follows:

Download some metadata on World Bank documents using the World Bank API
Build an ontology using the metadata for the documents
Populate the ontology with instances of documents
Pull in additional entities and relationships into the graph using Wikidata
Query the KG directly using SPARQL
Compare the different ways of interacting with data: using SPARQL queries on the RDF, putting raw metadata into LlamaIndex, and putting the RDF data into LlamaIndex

If you don’t want to read any more, my overall takeaways are that:

While it is possible to use structured data (in the form of CSV or RDF files) directly, using in-context learning, to enhance an LLM, the results are not great. The LLM suffers the same problems: it sometimes gives correct answers but sometimes suffers from hallucinations (incorrect facts) and it is impossible to know how/why these are occurring.
Turning your structured data into a knowledge graph by building out an ontology and assigning each instance of data appropriate classes and properties can improve the results, but hallucinations and inexplicable inaccuracies persist.
If you are interested in specific queries across your personal structured dataset and need accurate and verifiable answers, you should use formal queries either using SPARQL or something else. A SPARQL query can answer a question like, “what are all of the projects associated with documents that this author has written?” much better than an LLM, even when enhanced through in-context learning with the KG.
An LLM can help write and refine the SPARQL query, however. If an LLM can translate a prompt into a SPARQL query, then a user can still ‘chat’ directly with structured data without writing their own code/queries.
LLMs are great at understanding and interpreting unstructured data. This capability extends even to structured data when unstructured information is embedded within it. For instance, if the structured data includes a column labeled ‘abstracts’ containing unstructured text, the LLM can leverage that data to generate insightful results.
Using both the raw metadata and the KG to enhance the LLM did not improve results.
Some other ways Shirui et al. suggest KGs and LLMs can work together: use LLMs to translate prompt into formal query, use KGs to validate the LLMs responses, use LLMs to build KGs, and use KGs to train LLMs.

1. Download some metadata on World Bank documents using the World Bank API

First we need the metadata for some World Bank docs. For full documentation on the World Bank Documents and Reports API, go here.

I selected World Bank document metadata as the foundation for our knowledge graph for several reasons. Firstly, the World Bank provides an API that enables access to their data. Additionally, the metadata associated with World Bank documents is comprehensive and offers valuable information. Lastly, my familiarity with this type of data ensures a better understanding of its structure and attributes. It’s important to note that the flexibility of building a knowledge graph applies to any data source, as long as one possesses sufficient domain knowledge to structure it effectively.

The following code gives us the metadata for the most recent 20 reports that have the phrase “sustainable development” in their title.

import requests
import json
import pandas as pd

url = 'https://search.worldbank.org/api/v2/wds'
params = {
    'format': 'json',
    'display_title': '"sustainable development"',
    'rows': 20,
    'page': 1
}

metadata_list = []

for i in range(1):
    response = requests.get(url, params=params)
    data = json.loads(response.content)
    for doc_id in data['documents']:
        metadata = data['documents'][doc_id]
        metadata_list.append(metadata)
        
    params['page'] += 1
    
df = pd.DataFrame(metadata_list)

Now we have our metadata saved as a pandas dataframe (df).

2. Build an ontology using the metadata from the documents

Now we can set up our ontology. What even is an ontology?

“An ontology defines a common vocabulary for researchers who need to share information in a domain. It includes machine-interpretable definitions of basic concepts in the domain and relations among them,” (Noy and McGuinness, 2001).

By adopting an ontology, we gain the ability to connect diverse data sources. In this tutorial, our main focus lies in World Bank data, however, if we were to integrate UN data, we would face challenges such as varying document type categories, date formats, and country names. An ontology serves as a valuable tool in bridging these differences and establishing connections between the datasets. Moreover, we can expand the scope of our knowledge graph by incorporating data from Wikidata, a widely used public knowledge graph closely related to Wikipedia. Aligning our data ontology with Wikidata’s ontology enables seamless integration of information from Wikidata into our own knowledge graph.

The following code sets up our graph

from rdflib import Graph, RDF, RDFS, Namespace, URIRef, Literal
from rdflib.namespace import SKOS, XSD
from SPARQLWrapper import SPARQLWrapper, JSON
from tqdm import tqdm

# Create a new RDF graph
g = Graph()

schema = Namespace('http://schema.org/')
wd = Namespace('http://www.wikidata.org/entity/')

# Define namespaces
prefixes = {
    'schema': schema,
    'wd': wd,
    'skos': SKOS,
    'xsd': XSD
}
for p, ns in prefixes.items():
    g.bind(p, ns)

There is a column in the metadata called, “count”, which represents the country of origin for a given document. We want to use this column to create a ‘country’ class in our ontology, along with subclasses for each unique country in this column.

def create_subclass_country(column):
    newClass = URIRef(schema + "country")
    g.add((newClass, RDFS.label, Literal("country", lang='en')))
    df[column] = df[column].astype(str)
    for value in df[column].unique():
        if value != "nan":
            # Check Wikidata for a matching class
            sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
            query = f"""
                SELECT ?class ?label WHERE {{
                    ?class wdt:P31 wd:Q6256 .
                    ?class rdfs:label "{value}"@en .
                    OPTIONAL {{ ?class skos:prefLabel ?label FILTER(lang(?label) = "en") }}
                    FILTER(REGEX(STR(?class), "^http://www.wikidata.org/entity/Q[0-9]+$"))
                }}
            """
            sparql.setQuery(query)
            sparql.setReturnFormat(JSON)
            results = sparql.query().convert()

            # If there is a match, use the Wikidata class as a subclass
            if results['results']['bindings']:

                #Get URI from Wikidata
                uri = results['results']['bindings'][0]['class']['value']

                #Get the 'Q ID' which is the unique ID at the end of the URI
                qid = uri.split('/')[-1]
                country_label = value

                #Create a subclass for each country under the country class
                subclass = URIRef(schema + country_label.replace(' ', '_'))
                g.add((subclass, RDF.type, RDFS.Class))
                g.add((subclass, RDFS.subClassOf, newClass))

                # Update the "country_URI" column with the URI for the current country
                df.loc[df[column] == value, "country_URI"] = uri
                uri = URIRef(uri)

                # Define the URI for the new Wikidata URI property
                wd_URI_property = URIRef(schema + "wd_URI")

                # Add the property to the RDF graph
                g.add((wd_URI_property, RDF.type, RDF.Property))

                # Add a label to the property
                label = Literal("Wikidata URI", lang="en")
                g.add((wd_URI_property, RDFS.label, label))

                #Add Wikidata URI as a property to each country class
                g.add((subclass, schema.wd_URI, uri))

                #Add label to each Wikidata Q ID code that it is the Q ID for this particular country
                g.add((uri, RDFS.label, Literal(f"{country_label} wikidata code", lang='en')))
                g.add((subclass, RDFS.label, Literal(value, lang='en')))
            else:    
                subclass = URIRef(schema + value.replace(' ', '_').replace('-','_'))
                g.add((subclass, RDF.type, RDFS.Class))
                g.add((subclass, RDFS.subClassOf, newClass))
                g.add((subclass, RDFS.label, Literal(value, lang='en')))

I use protégé, a free, open-source ontology editor, to view and sometimes manually adjust my ontologies. If you want to view the ontology as you create it, just save the graph as a ttl file and open with protégé.

#Save graph as ttl file for use in protégé
g.serialize('worldBankKG.ttl',format='turtle',prefixes = prefixes, encoding='urf-8')

Once you open the file in protégé, it should look something like this:

Screenshot from Protégé. Image by author.

Each of the unique countries in the ‘count’ column of the World Bank data will have created a unique subclass in our ontology under the ‘country’ class. You can see that each country also has a label and a Wikidata URI. In this case, Argentina’s Wikidata URI is: https://www.wikidata.org/entity/Q414.

That is also the link to the Wikidata page for the country Argentina.

Now we have to create classes and subclasses for everything else. I made classes for document type, World Bank project, World Bank trustfund, country, and authors. I’m not putting all of that code in this tutorial, but see my notebook for the code to create all of it.

3. Populate the ontology with instances of documents

The ontology is the backbone of the KG, but now we need to populate it with data. The ontology so far defines classes, subclasses, properties of classes and subclasses, and the relations between them.

For example, we now have a class for World Bank documents and a subclass for working papers. Our ontology establishes that a working paper is a subclass of (or a type of) a World Bank document. Our ontology establishes labels and other properties for these entities. For example, working paper is an entity in Wikidata, and so we include the Wikidata URI in our ontology as a property.

Conceptualization of several entities in an ontology. Image by author.

In the example above, however, there are no actual World Bank working papers i.e. there are no instances of working papers. We have established the class, not the instances.

Here is the code to go through our DataFrame row by row, and for each row, create an instance of a document, and assign that instance appropriate properties. Note that to run this code you will need to have created all of these classes already (again, see the notebook for full code).

#Create abstract property
df['abstracts'] = df['abstracts'].astype(str).str.replace('\n', '').replace('\\n','')
abstractIs_uri = URIRef(schema + "abstractIs")
g.add((abstractIs_uri, RDF.type, RDF.Property))
g.add((abstractIs_uri, RDFS.label, Literal("Short summary of the document.")))

#Create abstract class
abstract_class = URIRef(schema + "abstract")
g.add((abstract_class, RDFS.label, Literal("Short summary of a document.")))

#Create author properties
authoredBy_uri = URIRef(schema + "authoredBy")
authored_uri = URIRef(schema + "authored")
g.add((authoredBy_uri, RDF.type, RDF.Property))
g.add((authored_uri, RDF.type, RDF.Property))
g.add((authoredBy_uri, RDFS.label, Literal("This document was authored by this author.")))
g.add((authored_uri, RDFS.label, Literal("This author wrote this document.")))

#Define 'part of' property
isPartOf_uri = URIRef(schema + "isPartOf")
g.add((isPartOf_uri, RDF.type, RDF.Property))
g.add((isPartOf_uri, RDFS.label, Literal("This entity is a part of another entity")))

#Define 'countryOfOrigin' property
countryOfOrigin_uri = URIRef(schema + "countryOfOrigin")
g.add((countryOfOrigin_uri, RDF.type, RDF.Property))
g.add((countryOfOrigin_uri, RDFS.label, Literal("Country that this document is about.")))

# Create instances for each document and add author property
for index, row in tqdm(df.iterrows()):
    if not pd.isnull(row['id']) and not pd.isnull(row['docty']) and not pd.isnull(row['authors']):
        try:
            # Create the report instance
            instance = URIRef(schema + "doc/" + str(row['display_title']).replace(" ","_").replace("-","_"))
            g.add((instance, RDFS.label, Literal(str(row['display_title']), lang='en')))
            
            #Connect instances with types of documents
            doctype = URIRef(row['docty'])
            g.add((instance, RDF.type, doctype))
    
            #Connect instances with country of origin
            if row['count'] != "nan":
                country = URIRef(schema + str(row['count']).replace(" ","_").replace("-","_"))
                g.add((instance, countryOfOrigin_uri, country))

            #Connect instances with projects
            if row['projn'] != "nan":
                project = URIRef(schema + "project/" + str(row['projn']).replace(" ","_").replace("-","_"))
                g.add((instance, isPartOf_uri, project))

            #Connect instances with trustfund_keys
            if row['trustfund'] != "nan":
                tf_values = row['trustfund'].split(",")
                for tf in tf_values:
                    trustfund_uri = URIRef(schema + "trustfund/" + str(tf).replace(" ","_").replace("-","_"))
                    g.add((trustfund_uri, RDFS.label, Literal(f"Trustfund: {tf}")))
                    g.add((instance, isPartOf_uri, trustfund_uri))
                    g.add((trustfund_uri, countryOfOrigin_uri, country))
                
            #Connect instances with authors
            author_dict = ast.literal_eval(row['authors'])
            for author_dict_entries in author_dict.values():
                author_name = author_dict_entries['author']
                author_uri = URIRef(schema + "author/" + author_name.replace(" ", "_"))
                g.add((instance, authoredBy_uri, author_uri))
                g.add((author_uri, authored_uri, instance))

            #Add abstract
            if row['abstracts'] != "nan":
                abstract_uri = URIRef(schema + "abstract/" + str(row['display_title']).replace(" ","_").replace("-","_"))
                g.add((instance, abstractIs_uri, abstract_uri))
                g.add((abstract_uri, RDFS.label, Literal(str(row['abstracts']))))
                g.add((abstract_uri, RDF.type, abstract_class))
                g.add((abstract_uri, isPartOf_uri, abstract_uri))
        except:
            pass

Now we have an actual knowledge graph. An instance of a working paper, with its associated properties, can be visualized the following way:

Conceptualization of ontology populated with some instances of data. Image by author.

You can save this new file and open in protégé to ensure that all of the entities have been incorporated appropriately.

4. Pull in additional entities and relationships into the graph using Wikidata

Because many of the entities in our KG have Wikidata URIs associated with them, we can import additional data from Wikidata into our KG. For this demo, I have only done this for countries. In the following code, we iterate over the country column, query Wikidata for the country entity, and import all properties and their values for the given country. Note that this code can take a while to run — it has to query Wikidata for all of the properties for each country, and their values, and put them all into our graph.

import numpy as np

sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

# Create a cache to store property code-label mappings
property_cache = {}
entity_cache = {}

# Prepare a list to collect triples for bulk graph update
triples = []

# Iterate over the URIs and add the properties to the RDF graph
for uri in tqdm(df['country_URI']):        
    if isinstance(uri, str) and uri.startswith('http://www.wikidata.org/entity/Q'):
        class_uri = URIRef(uri)
        country_column = df.loc[df['country_URI'] == uri, 'count'].iloc[0]
        country_column = URIRef(schema + str(country_column).replace(" ", "_"))
        
        # Construct the SPARQL query
        qid = uri.split('/')[-1]
        query = f"""
        SELECT ?prop ?value WHERE {{
          wd:{qid} ?prop ?value .
          OPTIONAL {{ ?prop rdfs:label ?label . FILTER(lang(?label) = 'en') }}
        }}
        """

        # Set the query and response format
        sparql.setQuery(query)
        sparql.setReturnFormat(JSON)

        # Execute the query and retrieve the results
        results = sparql.query().convert()

        # Iterate over the results and add them to the RDF graph
        for result in results["results"]["bindings"]:
            prop = result["prop"]["value"]
            value = Literal(result["value"]["value"])
            triple = (country_column, None, None)  # Placeholder for triple
            
            if prop.startswith('http://www.wikidata.org/prop'):
                property_code = prop.split('/')[-1]
                # Check if the property code is already in the cache
                if property_code in property_cache:
                    property_label = property_cache[property_code]
                else:
                    # If not in cache, query and retrieve the property label
                    property_label = get_property_label(property_code)
                    # Store the property code-label mapping in the cache
                    property_cache[property_code] = property_label
                
                property_label_URI = URIRef(schema + property_label.replace(" ", "_"))
                triple = (country_column, property_label_URI, value)
            
            if value.startswith('http://www.wikidata.org/entity/Q'):
                entity_code = value.split('/')[-1]
                # Check if the entity code is already in the cache
                if entity_code in entity_cache:
                    entity_label = entity_cache[entity_code]
                else:
                    # If not in cache, query and retrieve the entity label
                    entity_label = get_entity_label(entity_code)
                    # Store the entity code-label mapping in the cache
                    entity_cache[entity_code] = entity_label

                entity_label_URI = URIRef(schema + str(entity_label).replace(" ", "_"))
                triple = (country_column, property_label_URI, entity_label_URI)
            
            triples.append(triple)
        
    elif isinstance(uri, float) and np.isnan(uri):
        continue
    else:
        continue

# Add all collected triples to the RDF graph in bulk
for subject, predicate, object_ in triples:
    if predicate is not None:
        g.add((subject, predicate, object_))

Now we have an ontology populated with data from the World Bank and additional data from Wikidata! If you open your KG in protégé you can explore all of the properties we have imported. Below is a screenshot of some of the properties we’ve imported for Argentina. All of this data can also be viewed directly on Wikidata here.

Wikidata has data like: form of government, head of state, diplomatic relations, life expectancy (and many other development indicators) over time, subregions/territories, and many more.

5. Query the KG directly using SPARQL

We can query this KG directly using SPARQL queries, the standard query language for RDF databases (what we have created). There is a SPARQL wrapper in Python that we can use.

For example, if want to see all documents about Brazil, we can run the following code:

# Step 1: Find the URI of Brazil in your ontology
brazil_uri = "<http://schema.org/Brazil>"  # Replace with the actual URI

# Step 2: Find the most relevant documents related to Brazil
documents_query = f"""
PREFIX schema: <http://schema.org/>
SELECT ?document
WHERE {{
  ?document a/rdfs:subClassOf* schema:world_bank_document ;
      schema:countryOfOrigin {brazil_uri} .

}}
"""
qres = g.query(documents_query)
for row in qres:
    print(f"Document ID: {row.document}")

This will output all World Bank documents in our KG that have ‘Brazil’ listed as the countryOfOrigin.

We can make SPARQL queries as complicated as we like and include any properties from either the World Bank metadata or the Wikidata that we imported. For example, what if we want to know which author has written the most documents about countries that are federal republics?

from rdflib import Graph, RDF, RDFS, URIRef

# Step 1: Find the URI of the basic_form_of_government you are interested in
government_form_uri = "<http://schema.org/federal_republic>"  # Replace with the actual URI

# Step 2: Query for authors who have written the most documents associated with countries having the basic_form_of_government as "federal_republic"
authors_query = f"""
PREFIX schema: <http://schema.org/>
PREFIX prop: <http://schema.org/property>
SELECT ?author (COUNT(?document) AS ?numDocuments)
WHERE {{
  ?document a/rdfs:subClassOf* schema:world_bank_document ;
            schema:countryOfOrigin [
                    schema:basic_form_of_government {government_form_uri}
            ] ;
            schema:authoredBy ?author .
}}
GROUP BY ?author
ORDER BY DESC(?numDocuments)
"""

# Execute the query
results = g.query(authors_query)

# Now you can process the results and present them as needed (e.g., using pandas DataFrames)
# For simplicity, here, I'm just printing the author names and the number of documents they wrote
for row in results:
    print(f"Author: {row.author}, Number of Documents: {row.numDocuments}")

The output will look something like this:

We can see that the author “World Bank” has written five documents about countries that Wikidata has listed as federal republics. In this case, the countries are Brazil, Comoros, and Ethiopia. This is significant because the World Bank data does not tell us the form of government for any country, that came from Wikidata. Since we built an ontology that aligns with Wikidata, we can incorporate additional data from Wikidata easily.

6. Compare the different ways of interacting with data

We know that we can use SPARQL to query the KG and get accurate results. But that requires that we write SPARQL queries, which requires some technical capabilities. Can we combine this accuracy with the ease of use of an LLM so that we can ‘chat’ directly with the data? LlamaIndex is a powerful tool that allows you to enhance an LLM using your own data, what they call context information. Here is a LlamaIndex tutorial to get started. I will use LlamaIndex to incorporate this World Bank data into an LLM in two ways: using the raw CSV file that we got directly from the World Bank API, and using the KG that we’ve built and populated using the World Bank data.

Use raw CSV file we got directly from World Bank API

As a starting point, let’s load the raw metadata from the World Bank into LlamaIndex. This will serve as a benchmark to which we can compare the results after incorporating the KG. The raw data that comes from the World Bank is in CSV format and does not have any ontology associated with it. Here is all the code you need to get LlamaIndex set up. You’ll need an OpenAI API key, which you can get from the OpenAI website. This code reads data directly from a local data folder, I have named my folder ‘data’. You can just put a csv file directly into this folder and LlamaIndex will index it.

from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader, ServiceContext, LLMPredictor
from langchain import OpenAI
import os 
import openai

os.environ["OPENAI_API_KEY"] = <YOUR API KEY>  # replace with yours
openai.api_key = os.environ["OPENAI_API_KEY"]

documents = SimpleDirectoryReader('data').load_data()

query_engine = index.as_query_engine()

Now we can ask it some questions, using basic human English, just like we would ChatGPT.

response = query_engine.query("Show me all of the World Bank documents in the context information about Brazil")
print(response)

Here is the response:

Brazil — LATIN AMERICA AND CARIBBEAN — P126452 — Rio Grande do Norte: Regional Development and Governance — Audited Financial Statement Brazil — LATIN AMERICA AND CARIBBEAN- P158000- Amazon Sustainable Landscapes Project — Procurement Plan

These documents are, in fact, World Bank documents about Brazil that are in the context information. There are, however, many other documents that are not listed here.

Now let’s try to get all of the documents that Anna Corsi has written.

response = query_engine.query("Based on the context information, what documents has Corsi,Anna written?")
print(response)

The response is:

Corsi,Anna has not written any documents based on the context information.

This is incorrect. Anna Corsi is an author in the data.

Let’s try one more time:

response = query_engine.query("Tell me more about Anna Corsi")
print(response)

And the response is:

Anna Corsi is not mentioned in the given context information

Again, this is incorrect. Anna Corsi is one of the World Bank authors in our data.

Use the knowledge graph as input

Now the big question — does using a knowledge graph as our context information improve these results? Rather than load the RDF data into LlamaIndex the same way, we will use the RDFReader. I have found this method of loading data into LlamaIndex to be a bit more problematic and it takes longer, but it is recommended way of incorporating RDF data into your input prompt. RDFReader requires that every entity in the KG have a label. So even if you add a comment on an entity you also need to add a label to the comment.

from llama_index import GPTVectorStoreIndex, download_loader

RDFReader = download_loader("RDFReader")
document = RDFReader().load_data(file="path to your file")

# Define LLM
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-davinci-002"))

# NOTE: set a chunk size limit to < 1024 tokens 
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, chunk_size_limit=1012)

index = GPTVectorStoreIndex.from_documents(document, service_context=service_context)

Note: that I am still using the GPTVectorStoreIndex for the RDF data, rather than the KnowledgeGraphIndex. I didn’t find that the KnowledgeGraphIndex produced any better results but it took a lot more time and was a lot more costly. Others seem to have had similar experiences. My understanding is the the RDFReader is for using a pre-made KG and the KGIndex is for creating a KG with LlamaIndex.

Once initialized, we can ask this LLM the same questions to see if the answers are any better. For the first prompt, “Show me all of the World Bank documents in the context information about Brazil,” the response is:

<Brazil - LATIN AMERICA AND CARIBBEAN - P107146 - Acre Social and Economic Inclusion and Sustainable Development Project - PROACRE - Audited Financial Statement>
<Disclosable Version of the ISR - Rio de Janeiro Adjustment and Sustainable Development Policy Loan - P178729 - Sequence No : 01>
<Grosso Fiscal Adjustment and Environmental Sustainability Development Policy Loan>
<Disclosable Version of the ISR - Matanza-Riachuelo Basin Sustainable Development Project - P105680 - Sequence No : 29>
<Disclosable Version of the ISR - Matanza-Riachuelo Basin Sustainable Development Project - P105680 - Sequence No : 30>
<Disclosable Restructuring Paper - Health Sustainable Development Goals Program-for-Results - P123531>

This is partially correct. Two of these entities are documents about Brazil, but two are about Argentina, and one is about Ethiopia. There are also additional documents about Brazil that are not listed here.

Now let’s ask about a specific author again:

response = query_engine.query("Based on the context information, what documents has Corsi,Anna written?")
print(response)

The response is:

Concept Project Information Document (PID) - Land administration infrastructure for green and sustainable development - P179217

This is partially correct. Anna Corsi did write this document, but she wrote three other documents in the context information that are not listed here.

response = query_engine.query("Tell me more about Anna Corsi")
print(response)

The response is:

‘Anna Corsi is an author of the document “Concept Project Information Document (PID) — Land administration infrastructure for green and sustainable development — P179217”. This document is about a project to support the development of a mass valuation system in Turkey and generate market values for individual property units. Corsi is also the author of “The Time is Now : How Can Uzbekistan Leverage Urbanization as a Driver of Sustainable Development?”, a document about how Uzbekistan can use urbanization to promote sustainable development.’

This is partially correct. The first part, about the land administration project in Turkey, is correct. Anna Corsi was not, however, an author of the Uzbekistan report.

The LLM clearly struggles with specific structured data questions. But, as we said at the beginning, LLMs excel at understanding and generating human-like text, capturing intricate language patterns, and producing contextually relevant responses. Here is an example of how we can use the strengths of the LLM on this dataset. Suppose we already know that the World Bank has a land management infrastructure project in Turkey. We can ask the LLM to describe the project to us using the following prompt:

response = query_engine.query("Tell me more about World Bank's land management infrastructure project in Turkey")
print(response)

The response is:

The objective of the Land Management Infrastructure for Green and Sustainable Development Project is to improve the accuracy and accessibility of land administration information in Turkiye. There are three components to the project, the first component being creating 3D city models and updating cadastre data. This component will support: (i) the creation of 3D city models based on the proven approach tested in the Amasya pilot; and (ii) the completion of the update and verification of cadastral data for 6 million parcels (out of the total remaining 11 million parcels19 not covered by LRCMP), in both urban and rural areas. As part of the cadastre updating activities, capacity building programs for addressing challenges concerning women’s land rights and ownership will be discussed with TKGM to determine how to better address these issues as part of the public consultation step during the surveying process. While activities on the update and verification of cadastral data will be carried out in both urban and rural areas, for the creation of 3D city models, the Project will cover major urban areas (approximately 40,000 km, almost all urban areas in the country) in all 81 provinces in Turkiye. As part of these activities, the project will finance the completion and renewal

This is an accurate summary of the document in our database. This information largely comes from the ‘abstract’ field in the data.

Conclusion

In conclusion, incorporating structured data directly into LLMs through in-context learning yields suboptimal results, and challenges such as occasional hallucinations and lack of transparency persist. While transforming structured data into knowledge graphs improves LLM performance to some extent, formal query languages like SPARQL outperform LLMs in specific queries across structured datasets. However, LLMs can assist in formulating SPARQL queries, providing a way for users to ‘chat’ with structured data without writing formal queries.

LLMs have demonstrated their prowess in understanding and interpreting unstructured data, allowing them to provide relevant responses. This capability extends even to structured data when unstructured information is embedded within it. For instance, if the structured data includes a column labeled ‘abstracts’ containing unstructured text, the LLM can leverage that data to generate insightful results.

Alternative synergies between LLMs and knowledge graphs, such as using LLMs to translate prompts into formal queries or leveraging knowledge graphs for LLM validation, offer potential avenues for exploration and advancement in the field.