avatarAnthony Alcaraz

Summary

The provided content discusses the advancements and methodologies in leveraging large language models (LLMs) for the automated construction of knowledge graphs, emphasizing the integration of structured knowledge to enhance AI applications.

Abstract

The text outlines a transformative approach in AI, utilizing recent advances in large language models (LLMs) to automate the creation of knowledge graphs from unstructured data. It highlights the challenges of manual knowledge graph construction and the potential of LLMs to extract structured information, enabling more accurate, safe, and responsible AI systems. The article details a scalable methodology for generating high-quality knowledge graphs using LLMs, the importance of knowledge graphs in encoding concepts and relationships for AI reasoning and semantics, and the role of emergent techniques like graph neural networks in operationalizing these knowledge graphs. It also discusses the performance of LLMs in comparison to specialized models for knowledge graph construction tasks, the need for structured extraction guidelines, and the process of fine-tuning LLMs with quality datasets for domain-specific ontology extraction. The text concludes by emphasizing the benefits of LLM-based knowledge graph construction over manual approaches, including scalability, rapid iteration, easier debugging, and the enablement of structured analytics.

Opinions

  • The silent transformation in AI is driven by the integration of LLMs and knowledge graphs, which is essential for delivering accurate and responsible AI applications.
  • Knowledge graphs are transformational in structuring information to reveal insights, but manual construction does not scale effectively for enterprise needs.
  • LLMs, such as GPT-3 and GPT-4, have shown improvements in information extraction tasks but still have limitations compared to specialized models.
  • The construction of massive knowledge graphs can be significantly automated using LLMs, which can be further enhanced through techniques like graph embeddings and graph neural networks.
  • The combination of knowledge graphs and embeddings will form the bedrock of next-generation AI, providing the infrastructure for machine knowledge representation.
  • Mastering the stack of technologies involving LLMs, knowledge graphs, and embeddings is pivotal for engineering expertise as AI becomes ubiquitous.
  • The methodologies discussed in the article form the blueprint for knowledge-driven AI and have the potential to unlock enormous potential across various domains.
  • Despite the promise of LLMs for knowledge graph construction, their capabilities do not yet exceed those of specialized systems, but their generalization ability provides optimism for future advancements.
  • The use of LLMs for knowledge graph construction offers order-of-magnitude efficiency gains over manual labeling and enables more versatile and analytics-ready knowledge graphs.
  • A hybrid human-LM approach with checks and balances is recommended to mitigate the propagation of unintended biases inherent in LLMs.
  • The future of LLMs in knowledge graph construction includes the possibility of self-supervised lifelong learning approaches as models become more robust at multimodal information extraction.

The Silent Information Transformation with LLMs: Building Knowledge Graphs with Maximum Automation

As AI permeates across industries, a silent transformation is underway in how AI systems are built.

Recent advances in LLMs have exposed their limitations around reasoning and grounding.

The exponential growth of data has brought both opportunities and challenges. While we create over 2.5 quintillion bytes of data daily, extracting value from this data remains difficult.

This is where knowledge graphs can play a transformational role — by structuring information to reveal insights.

However, building knowledge graphs manually is cumbersome. This article outlines how recent advances in large language models (LLMs) enable automating major portions of the knowledge graph construction process from unstructured data.

These trends underscore the need for structured knowledge in delivering accurate, safe and responsible AI applications.

Knowledge graphs fill this gap by encoding concepts and relationships in machine-readable ontologies to drive reasoning and semantics.

However, constructing massive enterprise knowledge graphs manually does not scale. This is where LLMs come in — by automating ontology extraction from unstructured corpora.

We detail a scalable methodology to generate high-quality knowledge graphs leveraging LLMs. But knowledge alone is not sufficient.

Emergent techniques like graph neural networks, node2vec and GraphSAGE operationalize knowledge graphs by learning powerful embeddings.

By coupling graph embeddings powered retrieval with LLMs in a RAG framework, requests are guaranteed to be grounded in truth. Any hallucinated claims can be automatically flagged through verification queries on the knowledge graph.

Beyond improved accuracy, knowledge graphs also equip LLMs with commonsense reasoning and multi-step inference. This holds the potential to elevate RAG systems from narrow AI to more general machine intelligence.

Together, Knowledge Graphs and Embeddings will form the bedrock of next-gen AI by grounding language models, enabling reasoning and semantics. They provide the infrastructure for machine knowledge representation — much like databases did for software systems.

Mastering this stack will become pivotal engineering expertise as AI becomes ubiquitous.

It unlocks enormous potential for performant and responsible AI across domains like healthcare, finance, science and sustainability.

The methodologies discussed form the blueprint for knowledge-driven AI.

LLMs and KG construction :

(Zhu et al. 2023)

Zhu et al 2023 evaluates prominent LLMs such as ChatGPT and GPT-4 on various KG construction tasks like entity extraction, relation extraction, event extraction, and entity linking.

The performance of LLMs on KG construction tasks does not yet surpass specialized fine-tuned models developed specifically for these tasks. This is likely because KG construction involves multiple complex subtasks like named entity recognition and relation extraction :

  • Relation Extraction (RE)
  • Event Detection (ED)
  • Link Prediction (LP)
  • Question Answering (QA)

However, LLMs show promising capability improvements in one-shot and zero-shot settings over previous benchmarks. For example, GPT-4 demonstrates enhanced performance over ChatGPT in extracting relations from complex sentences.

There are several factors that currently limit the capabilities of LLMs for KG construction tasks, including dataset noise, semantic ambiguity, lack of type specifications, and prompt engineering.

The study introduces the concept of “Virtual Knowledge Extraction” to discern if LLM performance stems from memorized knowledge or true generalization ability. Initial findings indicate models like GPT-4 can rapidly acquire new extraction capabilities when provided sample instructions.

There are significant opportunities to enhance LLM capabilities for KG construction by optimizing prompt design, drawing on instructive examples for one-shot learning, accounting for dataset limitations, and leveraging human feedback.

While LLMs show promise for KG construction their capabilities do not yet exceed specialized systems. However, their ability to generalize provides optimism for future advancement in assisted and semi-automated KG construction pipelines.

Step 1: Set Up the LLM Pipeline

Open-source models like Zephyr 7B can be leveraged instead of commercial API-based models like GPT-3. Zephyr 7B is compatible with the Transformers library, allowing it to be used in a scalable pipeline.

For example:

# Install transformers from source - only needed for versions <= v4.34
# pip install git+https://github.com/huggingface/transformers.git
# pip install accelerate

import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-alpha", torch_dtype=torch.bfloat16, device_map="auto")

The LLama index combined with Neo4j functionality allows combining vector embeddings with graph data by indexing documents into a searchable store. The KnowledgeGraphIndex builder extracts triplets from text while storing embeddings:

from Llamaindex import LLMPredictor, KnowledgeGraphIndex

documents = SimpleDirectoryReader(
    r"GraphCreation"
).load_data()

index = KnowledgeGraphIndex.from_documents(
    documents, 
    query_keyword_extract_template=qa_template,
    storage_context=storage, 
    max_triplets_per_chunk=20,
    service_context=service,
    include_embeddings=True
)

This sets up an end-to-end pipeline leveraging Zephyr for text processing, with fine-tuning for information extraction, and LLama indexing for embedding powered search over the graph data.

The pipeline focuses on scalability via the Transformers and Llama inex libraries. Embeddings and incremental updates can enhance analysis.

II. Create Structured Extraction Guidelines

Step 1: Define Entity Types

Clearly specify the types of entities the LLM needs to extract (people, organizations, locations, medical conditions, etc.) along with definitions and examples of each entity type.

Step 2: Specify Relationships/Properties

Define the relationships and properties the LLM should identify between entities, providing schema examples. E.g. “employed_at”, “diagnosed_with”, “capital_of”

Step 3: Set Rules for Entity Resolution

Provide rules for how entities with similar surface forms should be resolved to canonical entities. Explain how to handle abbreviations, synonyms, alternate spellings, etc.

Step 4: Outline Numerical Data Handling

Instruct how to interpret and normalize numerical data — percentages, currency, measurements, etc. Specify any units that should be captured.

Step 5: Set Expectations for Output Structure

Show examples of the expected output structure — whether an ontology language like OWL, a graph query language like Cypher, or structured JSON.

Step 6: Include Corner Case Examples

Supplement guidelines with examples of complex inputs and how the LLM should handle them. Focus on areas humans also find tricky.

Step 7: Iterate on Real-World Cases

Test guidelines on actual samples from the target domain and update accordingly. The end guidelines should encode human logic.

By methodically building up these guidelines with concrete examples tailored to the application, we can tune LLMs to eliminate inconsistencies and errors in information extraction. This level of structure is essential for mapping unstructured data into clean knowledge graphs.

Example :

The overall goal is to define a template that instructs the language model how to extract ontology elements like classes, instances, and properties from text and output them in a structured format.

It starts by allowing the configuration of allowed node (entity) types and allowed relationship types:

doc = {
  'allowed_nodes': [], 
  'allowed_rels': []
}
allowed_nodes = doc.get('allowed_nodes', []) 
allowed_rels = doc.get('allowed_rels', [])

This allows constraining the output schema to fixed sets of classes and relationships.

The template content explains the key elements to extract:

  • Classes — Categories of entities
  • Instances — Individual entities belonging to classes
  • Properties — Attributes of and relationships between entities

It provides guidelines enforced by the prompt:

  • Use consistent class labels
  • Output human-readable instance IDs
  • Handle numerical data and dates as properties
  • Resolve coreferences
  • Infer new knowledge from existing extraction
  • Strict compliance with guidelines

The allowed nodes and relationships are inserted dynamically based on the config.

Finally, the template is wrapped into a Prompt object from LangChain for easy reuse:

qa_template = Prompt(template)

This template can now be utilized in Llamaindex pipelines to instruct models like LLMs to output ontology extractions in a structured format according to predefined guidelines.

from llama_index import Prompt

doc = {
    'allowed_nodes': [],
    'allowed_rels': []
}

allowed_nodes = doc.get('allowed_nodes', [])  # Replace with your list of allowed nodes
allowed_rels = doc.get('allowed_rels', [])  # Replace with your list of allowed relationships

template = (f"""
    You are a top-tier algorithm designed for extracting information in structured formats to build an ontology.
    - **Classes** represent categories or types of entities. They're akin to Wikipedia categories.
    - **Instances** are the individual entities belonging to these classes.
    - **Properties** define attributes and relationships between instances and classes.
    
    Here are the guidelines you should follow:

    ## 1. Identifying and Categorizing Entities
    - **Consistency**: Use consistent, basic types for class labels.
    - **Instance IDs**: Use names or human-readable identifiers found in the text as instance IDs. Avoid using integers.
    - **Allowed Class Labels:** {", ".join(allowed_nodes)}
    - **Allowed Property Types:** {", ".join(allowed_rels)}

    ## 2. Defining Hierarchical Relationships
    - Identify superclass-subclass relationships between classes. For example, if you identify "bird" and "sparrow", establish that "sparrow" is a subclass of "bird".

    ## 3. Handling Numerical Data and Dates
    - Numerical data should be incorporated as properties of the respective instances or classes.
    - Do not create separate classes for dates or numerical values. Always attach them as properties.

    ## 4. Coreference Resolution
    - Maintain consistency when extracting entities. If an entity is referred to by different names or pronouns, always use the most complete identifier.

    ## 5. Inferring New Knowledge
    - Use the existing information in the ontology to infer new knowledge. For example, if "sparrows" are a type of "bird", and "birds" can "fly", infer that "sparrows" can "fly".

    ## 6. Strict Compliance
    - Adhere to these rules strictly. Non-compliance will result in termination.
    """)
qa_template = Prompt(template)

Instead, it relies completely on in-context learning, where the model is prompted to perform named entity recognition on a given input text by providing it a few examples of the task. So the parameters of GPT-3 itself remain fixed.

GPT-NER :

The key aspects of the approach are:

  1. Transforming NER into a text generation problem by having the model output special tokens around entities.
  2. Carefully designing the prompt with a task description, demonstrations, etc. to guide the model.
  3. Using an entity-level embedding technique to retrieve high-quality demonstration examples that are relevant to the input text.
  4. Adding a self-verification module to mitigate against GPT-3’s tendency to hallucinate non-existent entities.

GPT-NER shows how large pre-trained language models like GPT-3 can achieve strong performance on downstream tasks through prompt and demonstration design, without any fine-tuning or parameter updates. The authors leverage GPT-3’s capabilities as a few-shot learner here.

III. Tune the Model with Quality Datasets (optional)

Step 1: Create Labelled Training Data

  • Sample representative text from the target domain (e.g. medical journals)
  • Manually annotate entities, relationships, classes in the samples
  • Create input-output pairs for LLM fine-tuning

For example:

Input: Penicillin is an antibiotic used to treat bacterial infections. It was discovered by Alexander Fleming.
Output:
{
  "Instances": [
    {"id": "penicillin", "type": "antibiotic"},  
    {"id": "Alexander_Fleming", "type": "person"}
  ],
  "Relationships": [
    {"from": "penicillin", "type": "used_to_treat", "to": "bacterial infections"}
    {"from": "penicillin", "type": "discovered_by", "to": "Alexander_Fleming"} 
  ]
}

Step 2: Train and Evaluate Models

  • Fine-tune language model (e.g. Zephyr) on labelled dataset
  • Validate model accuracy on separate test set
  • Plot metrics like precision, recall over time
  • Compare models trained with different hyperparameters
  • Select best performing model

Step 3: Iterative Improvement

  • Identify error categories (e.g. parse errors, missed entities etc.)
  • Expand training data with more examples fixing errors
  • Retrain model and confirm metrics improve
  • Repeat process until accuracy plateaus

This will tune the LLM to accurately extract key ontology components tailored to the domain, enabling high-quality knowledge graph construction.

IV. Extract Entities and Relationships

Step 1: Text Chunking

  • Split input documents into small chunks of ~500 words
  • Balance chunk size to provide context while ensuring entity resolution
chunker = TextChunker(chunk_size=500)
chunks = chunker.chunk(documents)

Step 2: Ontology Extraction per Chunk

  • Input each chunk text into fine-tuned LLM
  • LLM executes extraction prompt template
  • Outputs JSON with entities, relationships, classes
kg_extractor = Pipeline(model=zephyr, tokenizer=tokenizer, template=extraction_template)
for chunk in chunks:
   output = kg_extractor(chunk)

Step 3: Post-process Extractions

  • Normalize entity names, classes, IDs for consistency
  • Attach metadata like source chunk ID
def process(extractions):
   # normalize, attach metadata etc  
   return processed_extractions
outputs = [process(o) for o in outputs]

Step 4: Quality Checking

  • Spot check extractions
Image by the author : example with a Neo4j graph
  • Identify systematic errors through small sample audits
  • Retrain LLM on errors if needed

Entity Disambiguation

Since text chunks are processed independently, the same real-world entity can be extracted with slight variations across chunks. This causes multiple nodes representing the same entity.

Some solutions:

  • Entity linking models
  • Second LLM pass focused on disambiguation
  • Graph algorithms to cluster duplicate nodes

This is a critical step for accuracy. Without it, downstream applications on the knowledge graph become less effective.

import openai
from tqdm import tqdm
# Improved disambiguation prompt
disambiguation_prompt = """
You are an AI model specialized in entity disambiguation. Your task is to identify which values reference the same entity.
For instance, if I provide you with the following entities:
- Apple Inc.
- Apple
- Microsoft
You should return:
- Apple Inc., 1
- Apple, 1
- Microsoft, 2
This indicates that 'Apple Inc.' and 'Apple' refer to the same entity (1), while 'Microsoft' refers to a different entity (2).
Now, process the following entities:
"""
from openai import OpenAI
def disambiguate(entities):
client = OpenAI(api_key='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') # replace 'your-api-key' with your actual OpenAI API key
completion = client.chat.completions.create(
model="gpt-3.5-turbo-16k-0613",
messages=[
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": disambiguation_prompt + "\n".join(entities)
}
]
)
return completion.choices[0].message.content
# Get all node names from the updated graph
driver = GraphDatabase.driver(url, auth=(username, password))
with driver.session() as session:
result = session.run("MATCH (n) RETURN n.id AS name")
node_names_updated = [record['name'] for record in result]
# Disambiguate the node names
disambiguation_results_updated = []
for name in tqdm(node_names_updated, desc="Disambiguating entities"):
disambiguation_id = disambiguate(name)
disambiguation_results_updated.append((name, disambiguation_id))
# Print the disambiguation results
for result in disambiguation_results_updated:
print(result)
# Update the graph with the disambiguation results
for name, disambiguation_id in disambiguation_results_updated:
escaped_name = name.replace("'", "\\'") # Escape single quotes
query = f"MATCH (n) WHERE n.id = '{escaped_name}' SET n.disambiguation_id = '{disambiguation_id}'"
print(query)
with driver.session() as session:
session.run(query)

Step 5: Consolidate Chunk Graphs

  • Maintain mappings of coreferring entities across chunks
  • Resolve to canonical entities

This pipeline robustly processes large volumes of text, leveraging the fine-tuned LLM to extract ontology triplets from each chunk, while ensuring quality.

V. Build the Final Knowledge Graph

Step 1: Node and Edge Consolidation

  • Maintain a mapping of canonical node IDs
  • Replace any duplicate node IDs across chunks with canonical ID
  • Consolidate duplicate edges into a single edge
  • Sum edge weights when consolidating

This eliminates duplicates and creates a unified node and edge set.

def consolidate(graphs):
    nodes, edges = {}, {}
    
    for g in graphs:
        map_nodes(g, nodes)
        map_edges(g, edges)
return build_graph(nodes, edges)

Step 2: Attach Metadata

  • Store source document ID, chunk ID, etc as node/edge properties
  • Timestamp extraction date, model version etc.
  • Enables tracking provenance
def tag_metadata(graph):
    for node in graph.nodes:
        node['extracted_on'] = datetime
        node['source_doc'] = doc_id
    
    for edge in graph.edges:
       edge['model_version'] = 2.1

Step 3: Graph Analysis

  • Community detection to find related node clusters
  • Centrality measures like PageRank to ID important nodes
  • Visualize communities and centralities
communities = community_detection(graph) 
pageranks = pagerank(graph)
visualize(graph, communities, pageranks)

Step 4: Knowledge Graph Store

  • Bulk load consolidated property graph into Neo4j
  • Optimized for complex analytics queries
  • Scales to enterprise data

This brings together extracted knowledge from all documents into an integrated, analytics-ready knowledge graph with metadata tracking and community topological analysis.

VI. Benefits Over Manual Approaches

The methodology automates ontology development from unstructured data at scale. It saves thousands of human hours otherwise required for manual reading, labeling and linking. Being programmatic enables easier debugging, metrics-driven iteration and integration of emerging LLMs. The final knowledge graph is primed for analytics.

High Volume Scalability

  • LLMs process thousands of documents per hour
  • Scales linearly with compute resources
  • Human review bandwidth is bottleneck for manual labeling

Rapid Iteration and Improvement

  • Model correctness metrics guide LLM retraining
  • Frequent model updates possible, unlike manual process
  • Enables leveraging latest LLMs like GPT-4

Easier Debugging

  • Granular extraction logs point to error sources
  • Fixing requires prompt and dataset changes
  • Hard to diagnose human errors

Structured Analytics Enablement

  • Query explicitly defined schema and properties
  • Knowledge inference through semantic queries
  • Ad-hoc queries not feasible with unstructured data

Consistency and Reuse

  • Precise guidelines maximize ontology consistency
  • Extraction reproducibility enables incremental addition
  • Manual process prone to semantic drift

The LLM automation provides order-of-magnitude efficiency gains over manual labeling, while knowledge graph integration and queryability unlocks previously inaccessible insights.

The rapid iteration, metric-driven improvements and emerging LLM integration further differentiates this methodology from traditional ontology engineering.

VII. Use Cases and Limitations

LLM-based knowledge graph construction shines for large document corpora where manual labeling is infeasible.

Applications include medical research, legal contracts, customer engagement systems and financial reports. However, output quality hinges on training data and guidelines.

Using LLMs as black boxes can propagate unintended biases. A hybrid human-LM approach with checks and balances is prudent.

Looking Ahead As LLMs become robust at multimodal information extraction spanning text, images, speech and video, the methodology will grow more versatile.

Immediate opportunities lie in iterative knowledge graph construction — where existing graphs are automatically updated as new documents are ingested.

Such self-supervised Lifelong learning approaches will usher the next evolution in extracting intelligence from exponentially growing information.

Bonus : AutoKG, a multiagent framework

AutoKG which utilizes multiple conversational agents to automatically construct and reason about knowledge graphs.

Main Function

  • Sets up the assistant and user roles, defines a sample task prompt, and specifies the conversation flow using the starting_convo and get_sys_msgs functions.

CAMELAgent Class

  • Defines an agent that can participate in conversations by maintaining context, stepping through messages, and querying an underlying chat model (ChatGPT) to generate responses.

Starting Conversation

  • starting_convo initiates the conversation by asking the assistant agent to specify the initial task.
  • get_sys_msgs then sets up the initial system messages that define the role, task, and conventions for the assistant and user agents.

Conversation Loop

  • The main loop manages a conversation between the assistant and user agents, stepping each one to generate responses.
  • The assistant agent has the ability to retrieve supplementary information from search to aid in responding.

Retrieval Agent

  • Retrieval_Msg defines an agent that can specify a focused search query based on the current conversation and task.
  • It then retrieves summarizations and facts from the search results to enhance the assistant agent’s knowledge.

AutoKG employs multiple conversational agents with differentiated capabilities (conversation, retrieval etc) to automate the process of constructing and reasoning about knowledge graphs via an assisted dialogue. The system manages role assignments, task prompting, information retrieval, and response generation to complete the specified knowledge graph construction task.

task_specifier_prompt:

  • This prompt is used by the starting_convo function to ask the assistant agent to specify the initial task in more detail.
  • It instructs the assistant to restate the task, while being creative and imaginative, in a constrained number of words.

assistant_inception_prompt:

  • This prompt sets up the initial system message for the assistant agent, defining its role, relationship with the user agent, guidelines for solving the task through a collaborative dialogue.

user_inception_prompt:

  • Similarly, this prompt defines the initial system message for the user agent, explaining its role as the instructor providing guidelines and inputs to help the assistant agent complete the task.

retrieval_specifier_prompt:

  • This prompt is used to ask the assistant agent to formulate a focused search query based on the current task and conversation, in order to retrieve supplementary information from the web.

HumanMessagePromptTemplate:

  • This is a template class used to parameterize and format prompt templates to generate HumanMessage instances, which contain the formatted message text.

SystemMessagePromptTemplate:

  • Similarly, a template to generate SystemMessage instances which contain the initial instructions for the agents.
Main prompt logic from AutoKG (Zhu et al. 2023).

AutoKG uses specialized prompts for task specification, role inception, and information retrieval by assistant and user agents. Template classes help parameterize and generate these prompts to set up the conversational flow. The prompts provide precise instructions to agents on how to collaborate to construct knowledge graphs.

Image by the author

PlainEnglish.io 🚀

Thank you for being a part of the In Plain English community! Before you go:

AI
Data
Data Science
Machine Learning
Deep Learning
Recommended from ReadMedium