The Silent Information Transformation with LLMs: Building Knowledge Graphs with Maximum Automation

As AI permeates across industries, a silent transformation is underway in how AI systems are built.

Recent advances in LLMs have exposed their limitations around reasoning and grounding.

The exponential growth of data has brought both opportunities and challenges. While we create over 2.5 quintillion bytes of data daily, extracting value from this data remains difficult.

This is where knowledge graphs can play a transformational role — by structuring information to reveal insights.

However, building knowledge graphs manually is cumbersome. This article outlines how recent advances in large language models (LLMs) enable automating major portions of the knowledge graph construction process from unstructured data.

These trends underscore the need for structured knowledge in delivering accurate, safe and responsible AI applications.

Knowledge graphs fill this gap by encoding concepts and relationships in machine-readable ontologies to drive reasoning and semantics.

However, constructing massive enterprise knowledge graphs manually does not scale. This is where LLMs come in — by automating ontology extraction from unstructured corpora.

We detail a scalable methodology to generate high-quality knowledge graphs leveraging LLMs. But knowledge alone is not sufficient.

Emergent techniques like graph neural networks, node2vec and GraphSAGE operationalize knowledge graphs by learning powerful embeddings.

By coupling graph embeddings powered retrieval with LLMs in a RAG framework, requests are guaranteed to be grounded in truth. Any hallucinated claims can be automatically flagged through verification queries on the knowledge graph.

Why RAG (Retrieval Augmented Generation) will become a cornerstone of system design using LLM…

Recent advances in large language models (LLMs) like GPT-3 [1] have demonstrated powerful few-shot learning abilities —…

ai.plainenglish.io

Beyond improved accuracy, knowledge graphs also equip LLMs with commonsense reasoning and multi-step inference. This holds the potential to elevate RAG systems from narrow AI to more general machine intelligence.

Together, Knowledge Graphs and Embeddings will form the bedrock of next-gen AI by grounding language models, enabling reasoning and semantics. They provide the infrastructure for machine knowledge representation — much like databases did for software systems.

Vector Search Is Not All You Need

Introduction

towardsdatascience.com

Mastering this stack will become pivotal engineering expertise as AI becomes ubiquitous.

It unlocks enormous potential for performant and responsible AI across domains like healthcare, finance, science and sustainability.

The methodologies discussed form the blueprint for knowledge-driven AI.

LLMs and KG construction :

Zhu et al 2023 evaluates prominent LLMs such as ChatGPT and GPT-4 on various KG construction tasks like entity extraction, relation extraction, event extraction, and entity linking.

The performance of LLMs on KG construction tasks does not yet surpass specialized fine-tuned models developed specifically for these tasks. This is likely because KG construction involves multiple complex subtasks like named entity recognition and relation extraction :

Relation Extraction (RE)
Event Detection (ED)
Link Prediction (LP)
Question Answering (QA)

However, LLMs show promising capability improvements in one-shot and zero-shot settings over previous benchmarks. For example, GPT-4 demonstrates enhanced performance over ChatGPT in extracting relations from complex sentences.

There are several factors that currently limit the capabilities of LLMs for KG construction tasks, including dataset noise, semantic ambiguity, lack of type specifications, and prompt engineering.

The study introduces the concept of “Virtual Knowledge Extraction” to discern if LLM performance stems from memorized knowledge or true generalization ability. Initial findings indicate models like GPT-4 can rapidly acquire new extraction capabilities when provided sample instructions.

There are significant opportunities to enhance LLM capabilities for KG construction by optimizing prompt design, drawing on instructive examples for one-shot learning, accounting for dataset limitations, and leveraging human feedback.

While LLMs show promise for KG construction their capabilities do not yet exceed specialized systems. However, their ability to generalize provides optimism for future advancement in assisted and semi-automated KG construction pipelines.

Step 1: Set Up the LLM Pipeline

Open-source models like Zephyr 7B can be leveraged instead of commercial API-based models like GPT-3. Zephyr 7B is compatible with the Transformers library, allowing it to be used in a scalable pipeline.

For example:

# Install transformers from source - only needed for versions <= v4.34
# pip install git+https://github.com/huggingface/transformers.git
# pip install accelerate

import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-alpha", torch_dtype=torch.bfloat16, device_map="auto")

The LLama index combined with Neo4j functionality allows combining vector embeddings with graph data by indexing documents into a searchable store. The KnowledgeGraphIndex builder extracts triplets from text while storing embeddings:

from Llamaindex import LLMPredictor, KnowledgeGraphIndex

documents = SimpleDirectoryReader(
    r"GraphCreation"
).load_data()

index = KnowledgeGraphIndex.from_documents(
    documents, 
    query_keyword_extract_template=qa_template,
    storage_context=storage, 
    max_triplets_per_chunk=20,
    service_context=service,
    include_embeddings=True
)

This sets up an end-to-end pipeline leveraging Zephyr for text processing, with fine-tuning for information extraction, and LLama indexing for embedding powered search over the graph data.

The pipeline focuses on scalability via the Transformers and Llama inex libraries. Embeddings and incremental updates can enhance analysis.

Embeddings + Knowledge Graphs: The Ultimate Tools for RAG Systems

The advent of large language models (LLMs) , trained on vast amounts of text data, has been one of the most significant…

towardsdatascience.com

II. Create Structured Extraction Guidelines

Step 1: Define Entity Types

Clearly specify the types of entities the LLM needs to extract (people, organizations, locations, medical conditions, etc.) along with definitions and examples of each entity type.

Step 2: Specify Relationships/Properties

Define the relationships and properties the LLM should identify between entities, providing schema examples. E.g. “employed_at”, “diagnosed_with”, “capital_of”

Step 3: Set Rules for Entity Resolution

Provide rules for how entities with similar surface forms should be resolved to canonical entities. Explain how to handle abbreviations, synonyms, alternate spellings, etc.

Step 4: Outline Numerical Data Handling

Instruct how to interpret and normalize numerical data — percentages, currency, measurements, etc. Specify any units that should be captured.

Step 5: Set Expectations for Output Structure

Show examples of the expected output structure — whether an ontology language like OWL, a graph query language like Cypher, or structured JSON.

Step 6: Include Corner Case Examples

Supplement guidelines with examples of complex inputs and how the LLM should handle them. Focus on areas humans also find tricky.

Step 7: Iterate on Real-World Cases

Test guidelines on actual samples from the target domain and update accordingly. The end guidelines should encode human logic.

By methodically building up these guidelines with concrete examples tailored to the application, we can tune LLMs to eliminate inconsistencies and errors in information extraction. This level of structure is essential for mapping unstructured data into clean knowledge graphs.

Example :

The overall goal is to define a template that instructs the language model how to extract ontology elements like classes, instances, and properties from text and output them in a structured format.

It starts by allowing the configuration of allowed node (entity) types and allowed relationship types:

doc = {
  'allowed_nodes': [], 
  'allowed_rels': []
}
allowed_nodes = doc.get('allowed_nodes', []) 
allowed_rels = doc.get('allowed_rels', [])

This allows constraining the output schema to fixed sets of classes and relationships.

The template content explains the key elements to extract:

Classes — Categories of entities
Instances — Individual entities belonging to classes
Properties — Attributes of and relationships between entities

It provides guidelines enforced by the prompt:

Use consistent class labels
Output human-readable instance IDs
Handle numerical data and dates as properties
Resolve coreferences
Infer new knowledge from existing extraction
Strict compliance with guidelines

The allowed nodes and relationships are inserted dynamically based on the config.

Finally, the template is wrapped into a Prompt object from LangChain for easy reuse:

qa_template = Prompt(template)

This template can now be utilized in Llamaindex pipelines to instruct models like LLMs to output ontology extractions in a structured format according to predefined guidelines.

from llama_index import Prompt

doc = {
    'allowed_nodes': [],
    'allowed_rels': []
}

allowed_nodes = doc.get('allowed_nodes', [])  # Replace with your list of allowed nodes
allowed_rels = doc.get('allowed_rels', [])  # Replace with your list of allowed relationships

template = (f"""
    You are a top-tier algorithm designed for extracting information in structured formats to build an ontology.
    - **Classes** represent categories or types of entities. They're akin to Wikipedia categories.
    - **Instances** are the individual entities belonging to these classes.
    - **Properties** define attributes and relationships between instances and classes.
    
    Here are the guidelines you should follow:

    ## 1. Identifying and Categorizing Entities
    - **Consistency**: Use consistent, basic types for class labels.
    - **Instance IDs**: Use names or human-readable identifiers found in the text as instance IDs. Avoid using integers.
    - **Allowed Class Labels:** {", ".join(allowed_nodes)}
    - **Allowed Property Types:** {", ".join(allowed_rels)}

    ## 2. Defining Hierarchical Relationships
    - Identify superclass-subclass relationships between classes. For example, if you identify "bird" and "sparrow", establish that "sparrow" is a subclass of "bird".

    ## 3. Handling Numerical Data and Dates
    - Numerical data should be incorporated as properties of the respective instances or classes.
    - Do not create separate classes for dates or numerical values. Always attach them as properties.

    ## 4. Coreference Resolution
    - Maintain consistency when extracting entities. If an entity is referred to by different names or pronouns, always use the most complete identifier.

    ## 5. Inferring New Knowledge
    - Use the existing information in the ontology to infer new knowledge. For example, if "sparrows" are a type of "bird", and "birds" can "fly", infer that "sparrows" can "fly".

    ## 6. Strict Compliance
    - Adhere to these rules strictly. Non-compliance will result in termination.
    """)
qa_template = Prompt(template)

Instead, it relies completely on in-context learning, where the model is prompted to perform named entity recognition on a given input text by providing it a few examples of the task. So the parameters of GPT-3 itself remain fixed.

GPT-NER :

GPT-NER: Named Entity Recognition via Large Language Models

Despite the fact that large-scale Language Models (LLM) have achieved SOTA performances on a variety of NLP tasks, its…

arxiv.org

The key aspects of the approach are:

Transforming NER into a text generation problem by having the model output special tokens around entities.
Carefully designing the prompt with a task description, demonstrations, etc. to guide the model.
Using an entity-level embedding technique to retrieve high-quality demonstration examples that are relevant to the input text.
Adding a self-verification module to mitigate against GPT-3’s tendency to hallucinate non-existent entities.

GPT-NER shows how large pre-trained language models like GPT-3 can achieve strong performance on downstream tasks through prompt and demonstration design, without any fine-tuning or parameter updates. The authors leverage GPT-3’s capabilities as a few-shot learner here.

III. Tune the Model with Quality Datasets (optional)

Step 1: Create Labelled Training Data

Sample representative text from the target domain (e.g. medical journals)
Manually annotate entities, relationships, classes in the samples
Create input-output pairs for LLM fine-tuning

For example:

Input: Penicillin is an antibiotic used to treat bacterial infections. It was discovered by Alexander Fleming.
Output:
{
  "Instances": [
    {"id": "penicillin", "type": "antibiotic"},  
    {"id": "Alexander_Fleming", "type": "person"}
  ],
  "Relationships": [
    {"from": "penicillin", "type": "used_to_treat", "to": "bacterial infections"}
    {"from": "penicillin", "type": "discovered_by", "to": "Alexander_Fleming"} 
  ]
}

Step 2: Train and Evaluate Models

Fine-tune language model (e.g. Zephyr) on labelled dataset
Validate model accuracy on separate test set
Plot metrics like precision, recall over time
Compare models trained with different hyperparameters
Select best performing model

Step 3: Iterative Improvement

Identify error categories (e.g. parse errors, missed entities etc.)
Expand training data with more examples fixing errors
Retrain model and confirm metrics improve
Repeat process until accuracy plateaus

This will tune the LLM to accurately extract key ontology components tailored to the domain, enabling high-quality knowledge graph construction.

IV. Extract Entities and Relationships

Step 1: Text Chunking

Split input documents into small chunks of ~500 words
Balance chunk size to provide context while ensuring entity resolution

chunker = TextChunker(chunk_size=500)
chunks = chunker.chunk(documents)

Step 2: Ontology Extraction per Chunk

Input each chunk text into fine-tuned LLM
LLM executes extraction prompt template
Outputs JSON with entities, relationships, classes

kg_extractor = Pipeline(model=zephyr, tokenizer=tokenizer, template=extraction_template)

for chunk in chunks:
   output = kg_extractor(chunk)

Step 3: Post-process Extractions

Normalize entity names, classes, IDs for consistency
Attach metadata like source chunk ID

def process(extractions):
   # normalize, attach metadata etc  
   return processed_extractions

outputs = [process(o) for o in outputs]

Step 4: Quality Checking

Spot check extractions

Image by the author : example with a Neo4j graph

Identify systematic errors through small sample audits
Retrain LLM on errors if needed

Entity Disambiguation

Since text chunks are processed independently, the same real-world entity can be extracted with slight variations across chunks. This causes multiple nodes representing the same entity.

Some solutions:

Entity linking models
Second LLM pass focused on disambiguation
Graph algorithms to cluster duplicate nodes

This is a critical step for accuracy. Without it, downstream applications on the knowledge graph become less effective.

import openai
from tqdm import tqdm
# Improved disambiguation prompt
disambiguation_prompt = """
You are an AI model specialized in entity disambiguation. Your task is to identify which values reference the same entity.
For instance, if I provide you with the following entities:
- Apple Inc.
- Apple
- Microsoft
You should return:
- Apple Inc., 1
- Apple, 1
- Microsoft, 2
This indicates that 'Apple Inc.' and 'Apple' refer to the same entity (1), while 'Microsoft' refers to a different entity (2).
Now, process the following entities:
"""
from openai import OpenAI
def disambiguate(entities):
client = OpenAI(api_key='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') # replace 'your-api-key' with your actual OpenAI API key
completion = client.chat.completions.create(
model="gpt-3.5-turbo-16k-0613",
messages=[
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": disambiguation_prompt + "\n".join(entities)
}
]
)
return completion.choices[0].message.content
# Get all node names from the updated graph
driver = GraphDatabase.driver(url, auth=(username, password))
with driver.session() as session:
result = session.run("MATCH (n) RETURN n.id AS name")
node_names_updated = [record['name'] for record in result]
# Disambiguate the node names
disambiguation_results_updated = []
for name in tqdm(node_names_updated, desc="Disambiguating entities"):
disambiguation_id = disambiguate(name)
disambiguation_results_updated.append((name, disambiguation_id))
# Print the disambiguation results
for result in disambiguation_results_updated:
print(result)
# Update the graph with the disambiguation results
for name, disambiguation_id in disambiguation_results_updated:
escaped_name = name.replace("'", "\\'") # Escape single quotes
query = f"MATCH (n) WHERE n.id = '{escaped_name}' SET n.disambiguation_id = '{disambiguation_id}'"
print(query)
with driver.session() as session:
session.run(query)

Step 5: Consolidate Chunk Graphs

Maintain mappings of coreferring entities across chunks
Resolve to canonical entities

This pipeline robustly processes large volumes of text, leveraging the fine-tuned LLM to extract ontology triplets from each chunk, while ensuring quality.

V. Build the Final Knowledge Graph

Step 1: Node and Edge Consolidation

Maintain a mapping of canonical node IDs
Replace any duplicate node IDs across chunks with canonical ID
Consolidate duplicate edges into a single edge
Sum edge weights when consolidating

This eliminates duplicates and creates a unified node and edge set.

def consolidate(graphs):
    nodes, edges = {}, {}
    
    for g in graphs:
        map_nodes(g, nodes)
        map_edges(g, edges)
return build_graph(nodes, edges)

Step 2: Attach Metadata

Store source document ID, chunk ID, etc as node/edge properties
Timestamp extraction date, model version etc.
Enables tracking provenance

def tag_metadata(graph):
    for node in graph.nodes:
        node['extracted_on'] = datetime
        node['source_doc'] = doc_id
    
    for edge in graph.edges:
       edge['model_version'] = 2.1

Step 3: Graph Analysis

Community detection to find related node clusters
Centrality measures like PageRank to ID important nodes
Visualize communities and centralities

communities = community_detection(graph) 
pageranks = pagerank(graph)
visualize(graph, communities, pageranks)

Step 4: Knowledge Graph Store

Bulk load consolidated property graph into Neo4j
Optimized for complex analytics queries
Scales to enterprise data

This brings together extracted knowledge from all documents into an integrated, analytics-ready knowledge graph with metadata tracking and community topological analysis.

VI. Benefits Over Manual Approaches

The methodology automates ontology development from unstructured data at scale. It saves thousands of human hours otherwise required for manual reading, labeling and linking. Being programmatic enables easier debugging, metrics-driven iteration and integration of emerging LLMs. The final knowledge graph is primed for analytics.

High Volume Scalability

LLMs process thousands of documents per hour
Scales linearly with compute resources
Human review bandwidth is bottleneck for manual labeling

Rapid Iteration and Improvement

Model correctness metrics guide LLM retraining
Frequent model updates possible, unlike manual process
Enables leveraging latest LLMs like GPT-4

Easier Debugging

Granular extraction logs point to error sources
Fixing requires prompt and dataset changes
Hard to diagnose human errors

Structured Analytics Enablement

Query explicitly defined schema and properties
Knowledge inference through semantic queries
Ad-hoc queries not feasible with unstructured data

Consistency and Reuse

Precise guidelines maximize ontology consistency
Extraction reproducibility enables incremental addition
Manual process prone to semantic drift

The LLM automation provides order-of-magnitude efficiency gains over manual labeling, while knowledge graph integration and queryability unlocks previously inaccessible insights.

The rapid iteration, metric-driven improvements and emerging LLM integration further differentiates this methodology from traditional ontology engineering.

VII. Use Cases and Limitations

LLM-based knowledge graph construction shines for large document corpora where manual labeling is infeasible.

Applications include medical research, legal contracts, customer engagement systems and financial reports. However, output quality hinges on training data and guidelines.

Using LLMs as black boxes can propagate unintended biases. A hybrid human-LM approach with checks and balances is prudent.

Looking Ahead As LLMs become robust at multimodal information extraction spanning text, images, speech and video, the methodology will grow more versatile.

Immediate opportunities lie in iterative knowledge graph construction — where existing graphs are automatically updated as new documents are ingested.

Such self-supervised Lifelong learning approaches will usher the next evolution in extracting intelligence from exponentially growing information.

Bonus : AutoKG, a multiagent framework

AutoKG which utilizes multiple conversational agents to automatically construct and reason about knowledge graphs.

GitHub - zjunlp/AutoKG: Code and dataset for the paper "LLMs for Knowledge Graph Construction and…

Code and dataset for the paper "LLMs for Knowledge Graph Construction and Reasoning: Recent Capabilities and Future…

github.com

Main Function

Sets up the assistant and user roles, defines a sample task prompt, and specifies the conversation flow using the starting_convo and get_sys_msgs functions.

CAMELAgent Class

Defines an agent that can participate in conversations by maintaining context, stepping through messages, and querying an underlying chat model (ChatGPT) to generate responses.

Starting Conversation

starting_convo initiates the conversation by asking the assistant agent to specify the initial task.
get_sys_msgs then sets up the initial system messages that define the role, task, and conventions for the assistant and user agents.

Conversation Loop

The main loop manages a conversation between the assistant and user agents, stepping each one to generate responses.
The assistant agent has the ability to retrieve supplementary information from search to aid in responding.

Retrieval Agent

Retrieval_Msg defines an agent that can specify a focused search query based on the current conversation and task.
It then retrieves summarizations and facts from the search results to enhance the assistant agent’s knowledge.

AutoKG employs multiple conversational agents with differentiated capabilities (conversation, retrieval etc) to automate the process of constructing and reasoning about knowledge graphs via an assisted dialogue. The system manages role assignments, task prompting, information retrieval, and response generation to complete the specified knowledge graph construction task.

task_specifier_prompt:

This prompt is used by the starting_convo function to ask the assistant agent to specify the initial task in more detail.
It instructs the assistant to restate the task, while being creative and imaginative, in a constrained number of words.

assistant_inception_prompt:

This prompt sets up the initial system message for the assistant agent, defining its role, relationship with the user agent, guidelines for solving the task through a collaborative dialogue.

user_inception_prompt:

Similarly, this prompt defines the initial system message for the user agent, explaining its role as the instructor providing guidelines and inputs to help the assistant agent complete the task.

retrieval_specifier_prompt:

This prompt is used to ask the assistant agent to formulate a focused search query based on the current task and conversation, in order to retrieve supplementary information from the web.

HumanMessagePromptTemplate:

This is a template class used to parameterize and format prompt templates to generate HumanMessage instances, which contain the formatted message text.

SystemMessagePromptTemplate:

Similarly, a template to generate SystemMessage instances which contain the initial instructions for the agents.

Main prompt logic from AutoKG (Zhu et al. 2023).

AutoKG uses specialized prompts for task specification, role inception, and information retrieval by assistant and user agents. Template classes help parameterize and generate these prompts to set up the conversational flow. The prompts provide precise instructions to agents on how to collaborate to construct knowledge graphs.

PlainEnglish.io 🚀

Thank you for being a part of the In Plain English community! Before you go:

Be sure to clap and follow the writer️
Learn how you can also write for In Plain English️
Follow us: X | LinkedIn | YouTube | Discord | Newsletter
Visit our other platforms: Stackademic | CoFeed | Venture