avatarGiuseppe Futia

Summary

This article discusses using GPT models in a production-ready and scalable environment to produce a high-quality Knowledge Graph (KG) on top of Officer Diary Entries (ODEs) textual content, focusing on the case of the Rockefeller Archive Center (RAC).

Abstract

The article begins by introducing the RAC, an institution that collects information about science funding, including its long tail. The Officer Diaries (ODs) from the 1930s are used as a dataset, as they report conversations between RF Officers and scientists working on specific research topics in the natural sciences domain. The goal is to understand the dynamics driving the RF granting process and produce well-structured data shaped as a KG.

The authors then describe the information extraction pipeline built for this use case, which involves several steps, including document chunking, Named Entity Recognition (NER), Relation Extraction (RE), and Entity Resolution (ER). They explain how they used GPT models to perform these tasks and the challenges they faced, such as the token limitation of the used GPT model and the need to manage long inputs.

The article also discusses the importance of managing failures in a production environment and the need for robustness against the failures of GPT services and the results generated in a non-expected format. The authors present an Orchestra workflow that allows them to define branches to manage the failures of single or multiple components involved in the main process.

Finally, the article presents the results of performing advanced NER and RE steps on the ODEs, including sentiment analysis and conversation extraction. The authors show how they were able to extract more specific information relevant to the application domain, such as sentiment analysis and conversation extraction, and how they used this information to build a high-quality KG.

Bullet points

  • The article discusses using GPT models to produce a high-quality KG on top of ODEs textual content.
  • The RAC collects information about science funding, including its long tail.
  • The ODs from the 1930s are used as a dataset to understand the dynamics driving the RF granting process.
  • The information extraction pipeline involves several steps, including document chunking, NER, RE, and ER.
  • The authors used GPT models to perform these tasks and faced challenges such as the token limitation of the used GPT model and the need to manage long inputs.
  • The article discusses the importance of managing failures in a production environment and presents an Orchestra workflow to manage failures.
  • The authors present the results of performing advanced NER and RE steps on the ODEs, including sentiment analysis and conversation extraction.
  • The article shows how the authors were able to extract more specific information relevant to the application domain and build a high-quality KG.

LLMs and KGs in practice

Building KGs with LLMs — The Use Case of the Rockefeller Archive Center

How we used GPT to perform Named Entity Recognition, Relation Extraction, and Entity Resolution on typewritten documents from the beginning of the 20th century

The original article, co-authored with Alessia Melania Lonoce, is available on the GraphAware website

Image generated with Midjourney

In the first half of the 20th century, several scientists made enormous contributions to science and technology in different fields. Among them are several Nobel Prize winners, including Albert Einstein, Ernest Lawrence, and Niels Bohr, who made fundamental contributions to physics studies.

But what about their role as fundraisers? And what about their influence in funding research topics unrelated to their scientific contribution, such as life sciences and biology?

Institutions such as the Rockefeller Archive Center (RAC) collect information about science funding, including its long tail. In particular, RAC is a major repository and research center for studying the Rockefeller Foundation’s (RF) philanthropic activities and their impact worldwide. Among the documents used in this project, there are the Officer Diaries (ODs) dating back to the 1930s, which report on conversations between the RF Officers (Foundation’s representatives responsible for identifying research projects for funding) and scientists working on specific research topics in the natural sciences domain. This dataset was chosen because this is the exciting era when modern research topics, such as molecular biology or nuclear science, were being formed and defined.

Such diaries play an essential role because the reported conversations and interactions between people may lead to the granting of a specific research project. However, understanding the dynamics driving the RF granting process is complex and requires producing well-structured data shaped as Knowledge Graph (KG). Consider the case of Niels Bohr: he made foundational contributions to understanding atomic structure, but as reported by Wikipedia, he was also a promoter of scientific research. The KG created on the ODs revealed how this supporting activity has been concretely realised.

Moreover, this blog post will describe how we adopted the OpenAI GPT models in a production-ready and scalable environment to produce a high-quality KG on top of the ODs textual content. This environment is shipped with Hume, GraphAware’s end-to-end graph analytics platform. The lessons learned in this article can be easily applied and extended to different application domains, from science to law enforcement, by combining the powerful features of Hume and GenAI.

Let’s dive into the details

To move from Officer Diaries (ODs) to a KG, we exploited GPT and Hume’s capabilities in several situations to build a meaningful, useful, and high-quality KG.

Our journey will start with an overview of the entire information extraction pipeline and then continue to discuss the details.

The information extraction pipeline built for this use case involves several steps, each crucial to extracting relevant information accurately and efficiently.

Figure 1 — Information extraction pipeline, including multiple steps such as OCR, Document Chunking, Named Entity Recognition, Relation Extraction, Normalization and Entity Resolution
  • Document chunking: an ad-hoc chunking strategy for extracting diary entries;
  • Named Entity Recognition (NER): identifies and classifies named entities mentioned in unstructured texts into predefined categories;
  • Relation Extraction (RE): detects and classifies semantic relationships among pairs of entities;
  • Entity Resolution (ER): creates a connection between different data records referring to the same entity.

Document chunking: entry extraction from Officer Diaries

Usually, processing a large document can not be performed in one shot but requires a chunking strategy. GPT comes with a token limitation that, despite rapidly increasing with the release of new versions, does not allow us to process the whole dataset at once. Besides, in our experience, the longer the text to be analysed, the poorer the quality of the entities and relations extracted.

In our case, the documents are Officer Diaries (ODs), so the natural way to divide them is to consider the dates when each diary entry was written. This step is fundamental for our use case because improper document segmentation risks losing valuable context, which is necessary for high-quality NER and RE.

For the document chunking, we perform a Date Extraction step that extracts the dates within the pages corresponding to diary entries.

Figure 2 — Example of multiple Officer Diary entries detected in the document

For the Date Extraction, we provided GPT with an input list of enumerated lines corresponding to the page rows. For each page row, we specify in the prompt to detect a date that corresponds to a diary entry. The output format is a list of JSONs, wherein each element contains the lines in which the date appears and the normalised date, or a range of dates.

The Date extraction process includes different steps for robustness reasons.

Figure 3 — Date Extraction pipeline leveraging combining different models, including GPT-3.5-turbo and GPT-4

The first extraction attempt was made using the GPT-3.5-turbo model. If a failure occurs, a second attempt will be performed using the GPT-4 model. The last step is to check if the result includes only diary entry dates and not generic dates mentioned in the text. By leveraging the extracted dates, we reconstruct the text of the Officer Diary Entries (ODEs).

NER and RE on Officer Diary Entries

Executing NER and RE with GPT in a production environment requires considering multiple aspects:

  • The documents’ length and the related impact on the accuracy of results.
  • The flexibility in using various prompts to manage different cases and scenarios.
  • The robustness against the failures of GPT services and the results generated in a non-expected format.

To satisfy these requirements straightforwardly, we can directly leverage the features of the Orchestra system integrated into the Hume platform.

Processing ODEs based on the length

As discussed in the previous section, extracting fine-grained information using GPT requires splitting the full ODs into individual diary entries (ODEs). However, in some cases, the input text is still long for the task we have to perform because some detected ODEs are spread over multiple pages.

Figure 4 — Example of Hume Orchestra Workflow to process different types of documents with GPT by defining multiple branches

The figure above shows the Orchestra workflow built to process the text of the ODEs. More precisely, to deal with the situation mentioned earlier, we defined two branches: a straightforward one to process short diary entries and a more complex one, in which the text of the long diary entries are first split into blocks and then processed. This splitting step is adopted to manage the token limitation of the used GPT model GPT-4) and, more importantly, to limit the degradation of the response quality related to long inputs. We noticed that this aspect must be considered in this type of task and, in general, in extracting facts from text using a generative model.

The components used to create the prompt are: Prompt and parameters — Single page and Prompt and parameters — Multiple pages. The goal of both prompts is to perform NER and RE tasks. However, the key difference between them is how they compose the prompt. Indeed, for ODEs split into blocks, it’s needed to add context (from the previous block to the one we are processing), to take advantage of GPT’s implicit co-reference capabilities, or to handle the case where a sentence is split across the two blocks.

Process robustness

One of the key aspects related to the usage of GPT in a production-ready environment is related to failure management. We can identify two different types of failures:

  • GPT service failure: this could be related to multiple reasons, from malformed input to unavailability.
  • GPT response parser failure: the result is not in the expected format. Beyond the specificity of the prompt, this could happen due to the generative nature of the model.

To manage these cases, and potentially others, Orchestra workflows enable you to define branches useful to manage the failures of single or multiple components involved in the main process.

Figure 5— Example of Hume Orchestra Workflow to manage different types of failures

Each branch allows us to perform specific operations in case of component failures. In this scenario, we assign a distinctive label to the diary entries that generate the error to enable targeted investigation and, if needed, refine the process accordingly.

NER and RE results at a glance

To build an accurate Knowledge Graph from unstructured data, we need a high-quality knowledge extraction system that covers key building blocks such as NER and RE. To understand the effectiveness of this process, consider the following figure, which reports different pieces of text from a single Officer Diary Entry (ODE):

Figure 6— Example of Officer Diary Entry (ODE) split among multiple pages

In this case, Prof. O. Stutzer is explicitly mentioned once on a single page of an ODE. Then, on the same page, he is mentioned as S. and again as Stutzer on the following page. Moreover, each of these mentions is related to mentions of different entities such as organisations, occupations, and other people.

We aim to produce a single node in the KG corresponding to a unique entity representing all these different mentions of O. Stutzer and build the related relationships accordingly. To reach this goal, we instructed the model to recognise these situations. Moreover, we defined a prompt for leveraging the contextual information from the previous page to reconstruct the full names.

The following graph represents the result of performing advanced NER and RE steps on the previous text:

Figure 7— Graph structure achieved by performing NER and RE on the text reported in Figure 6

As you can see from this example, strings such as O. Stutzer and K. Jurasky have been recognized as person entities (purple nodes), while strings such as paleo-botanist and geologist have been recognized as occupation entities (red nodes). Moreover, in the case of person entities, the model has been able to reconcile different namings and perform the RE in an effective way, by connecting O. Stutzer to the related entities.

Starting from NER and RE tasks, we are able to extract more specific information relevant to the application domain, such as sentiment analysis and conversation extraction.

Sentiment analysis

When reading through the diaries, we noticed how often people shared their personal sentiments about others. This could be an essential part of KG-based analysis of grant assignment patterns, allowing us for example to identify true influencers. By adjusting the GPT prompt related to TALKED_ABOUT relations, we can easily identify and store these complex impressions. Consider the following excerpt:

Figure 8 — Example of Officer Diary Entry (ODE) including opinions expressed by some scientists such as James Franck about other scientists

From the text it clearly comes out the opinion of J. Franck about some colleagues. First, he talked about Stark and Schumann expressing as Schumann outclassed Stark, and then he cited Hertz and Debye as first-class physicists in Germany. All this information has been correctly extracted and the figure below contains the resulting graph.

Figure 9— Graph structure achieved by performing NER, RE, and sentiment analysis on the text reported in Figure 8. Compared to the textual content, the generated nodes related to people are labeled with their full names as a result of the Entity Resolution (ER) process

Conversation extraction

The diary entries are abundant in detailed descriptions of conversations between Rockefeller Foundation officers and scientists about their research projects. Consider the following piece of text:

Figure 10 — Example of Officer Diary Entry (ODE) including details about a conversation that happened during a lunch

Our goal is to identify these discussions in the text and convert them as Conversation nodes with many details, including the participants, the discussed topics, and the context, such as an in-person meeting or a lunch. We added a requirement to identify had conversation with relation types in the prompt:

[...] "had conversation with" (i.e., an Officer talked with one or multiple persons present at the time) [...]

Example:
{
    "entities": {
        "person": [
            {"id": 1,"name": "William Meade","titles": []},
            {"id": 2,"name": "WW","titles": []}
        ]
    }
}
{
    "relationships": {
        "had conversation with":[{"source": 2,"targets": [1],"context": "on-site"}]
    }
}

The result modeled as a graph is the following:

Figure 11 — Graph structure achieved by performing NER, RE, and conversation extraction on the text reported in Figure 10.

The model correctly identified a lunch conversation on December 13, 1935 between Warren Weaver (WW) and Prof. John J. Coss et al. about the function of the modern museum.

Let’s wrap up

The following figures show the graph result we are able to achieve by combining all the previous ingredients:

Figure 12 — Example of Officer Diary Entry (ODE) including details about all the information to extract
Figure 13 — Graph structure achieved by performing NER, RE, sentiment analysis, and conversation extraction on the text reported in Figure 12

This example shows how we can represent in a powerful way a conversation between the two RF Officers, Warren Weaver and W. Tisdale, and two scientists W. E. Tisdale with A. Lacassagne and F. Joliot. The latter also mentions the research activities of W. Bothe and P. A. M. Dirac. By reading the text, you will notice that these two persons are mentioned only with their surnames. Moreover, there is no previous textual context that contains the full names. However, one of the key aspects of building a high-quality KG is to identify the same person or organisation, beyond their multiple expressions in different pages or documents, to resolve ambiguous entities and represent them with unique nodes.

Graph-based Entity Resolution with GPT

Entity Resolution (ER) is a widely applicable approach to resolving data into unique and valuable entity profiles. This crucial process allows organisations to make key decisions based on complete and/or unscattered data.

In the case of unstructured content, the objective is to reconcile ambiguous entity mentions across paragraphs, pages, and even other documents into final resolved entity nodes.

For example, consider these different pieces of text coming from different ODEs of Frank B. Hanson (FBH) shown in the figure below that mentions “Curtis”.

Figure 14 — Example of ODEs mentioning scientists named “Curtis”

In this case, page 121 of the FBH ODE from 1934 mentions the surname Curtis. Another ODE from 1933 mentions the full name W. C. Curtis. And the last ODE from 1938 mentions Howard J. Curtis.

Based on different sources, could we leverage the relations generated by GPT on these different person mentions to clearly state that Curtis corresponds to W. C. Curtis or Howard J. Curtis, or neither?

Person ER through Organisations and Topics Normalisation

To perform ER on person entities, we cannot only consider the similarity of names, but also need to include other signals to increase fidelity. By performing complex Relation Extraction (RE) tasks, we have associated most of these names with the organisation they work for and the scientific field or topic they work on.

Although information about organisations and research topics is particularly useful, such information could be expressed in different ways. For instance, Oxford University could be interchangeably expressed as Oxford or University of Oxford. More advanced scenarios allow us to understand that Christ Church, Oxford is somehow connected to the academic institution.

Regarding the topics, we can express the same research field using different words. For example, radiation problems related to living organisms and effects of radiation upon living organisms express the same concept, but the adopted strings use different terminologies.

To address these specific issues and to aggregate (and potentially reconcile) the different expressions for organisations and topics, we adopted two different approaches, one based on the definition of a proper prompt and the other that leverages the embeddings computed by GPT.

In the case of the organisations, we specifically asked GPT to normalise the names and understand the hierarchical connections between the organisations. To reach this goal, we defined an Orchestra workflowin which the organisation names extracted from GPT during the NER/RE phase are passed again as input to GPT for the normalisation task. The defined prompt includes the following key information:

[...] normalize the organization name, identify the main institution of the organization (if exists) [...]

{
  "name": "harvard school of medicine",
  "name normalized": "Harvard Medical School",
  "main institution": "University of Harvard"
}

The resulting graph is the following:

Figure 15 — Normalization performed using GPT and applied to organisations

The yellow nodes are the entities extracted during the NER/RE phase, while the blue nodes define the resolved entities. This operation is beneficial for noise reduction, because the analyst will analyse only two nodes and their relationships, instead of four disconnected nodes.

Regarding the topics, our purpose is to group them in terms of semantic similarity. To achieve this, we decided to jointly exploit embedding and clustering operations. The former allows us to project topics in a vector space, where the closer the embeddings are to each other, the more semantically similar the topics are. More precisely, the embeddings are computed through text-embedding-ada-002 GPT model. The latter uses this closeness to group topics according to the desired semantic granularity using hierarchical clustering. The following figure shows the cluster (group) related to hyperglycemia. The topics are grouped into TopicGroup nodes, which are themselves grouped hierarchically. This enables the selection of the most suitable semantic grain for the use case, as in the example where a coarser grain allows to merge the cluster related to hyperglycemia with the one related to diabetes. Both, embedding and clustering operations, have been performed using Orchestra.

Figure 16 — Topics aggregation by clustering embeddings generated with GPT

Let’s see the impact of these normalisation steps in terms of persons’ resolution. The graph below shows why Chambers and Robert Chambers nodes must be resolved as the same person. The first one explicitly works for New York University. The second one works for Washington Square College. However, GPT recognized the University and the College are connected. As reported on New York University page in Wikipedia:

“In 1914, Washington Square College was founded as the downtown undergraduate college of NYU”.

Based on this non-trivial connection, we can state that Chambers and Robert Chambers very likely represent the same person in the real world, which would not be a conclusion accurate enough purely based on the same surname.

Figure 17 —ER applied to person nodes by leveraging their relationships with organisations

Now, let’s try to answer the question we posed at the beginning of this section dedicated to ER:

Could we leverage the relations generated by GPT on these different person mentions to clearly state that Curtis corresponds to W. C. Curtis or Howard J. Curtis, or neither?

The graph available below answers the question. As you can notice, Curtis works on radiation problems related to living organisms, while W. C. Curtis works on the study of the effects of radiation upon living organisms. These two strings represent the same research field, but a trivial algorithm for string similarity would struggle to detect such similarity. On the contrary, our approach to managing semantic similarity allows us to correctly resolve Curtis as W. C. Curtis and not as Howard J. Curtis.

Figure 18 — ER applied to “Curtis” by leveraging similarity between connected topics

KG-based Insights Discovery

The KG constructed using GPT combined with the Hume capabilities enables us to investigate the network-based dynamics that can reveal, for example, the influence of scientists such as Niels Bohr in promoting research activity. This analysis can start with the influence network that shows the most influential scientists by leveraging multiple relationship types, including TALKED_WITH, TALK_ABOUT, STUDENT_OF, and WORKS_WITH, and graph algorithms revealing patterns and insights.

The figure below gives you an idea of the complexity of this network (it represents only a sub-graph of the whole influence network), in which the dimension of the hub nodes depends on the score computed by the PageRank algorithm. In this context, PageRank measures the importance of a person within the network based on the number of incoming connections and the significance of the corresponding source of the relationships. This perspective is beneficial for our purpose, because if many relevant persons actively interact with a hub, this means that this hub can influence decisions. One of the most important communities in this network shows that Niels Bohr is a central node (highlighted in the figure below).

Figure 19 — Influence network showing interactions between scientists

Suppose you take a look at Niels Bohr’s Wikipedia page. Several scientists from this community (see the image below), such as Otto Stern, Harald Bohr, Georg Von Hevesy, and James Franck, are mentioned on this page and belong to Niels Bohr’s scientific field. On the one hand, this confirms that the network dynamics that emerged from the ODs correspond to well-known knowledge about these scientists, in particular Niels Bohr. On the other hand, we are interested in discovering long-tail information related to the influence of Niels Bohr in enabling the funding of different scientific fields.

Figure 20 — Influence network of Niels Bohr

To reach this goal, we can continue our investigation by exploring the grants awarded by the Rockefeller Foundation mentioning Niels Bohr and focusing on research topics that are not directly connected with his research activity. Consider the following example showing a grant depicted with the green colour.

Figure 21 — Subgraph showing a grant mentioning Niels Bohr and H. V. Brondsted

By exploring the connections to the grant, we discovered that it has been assigned to H. V. Brondsted and awarded to biology topics, such as experimental zoology. Moreover, the grant details mention Niels Bohr. Starting from this grant information, the information extracted from ODEs, and the KG built on top of this information, we can try to answer the following question:

Could we reconstruct the interactions between H. V. Brondsted, Niels Bohr, and other related key persons that may have led to the awarding of the grant?

From the graph above, we noticed a reciprocal TALKED_ABOUT relationship between H. V. Brondsted and Niels Bohr. Moreover, we saw that August Krogh, one of Bohr's main collaborators based on his personal network, also has a TALKED_ABOUT relationship with Bronsted that provides a positive judgment about the researcher.

By leveraging this portion of the graph and running an explanation action (a cypher query defined on the Hume platform) on the TALKED_ABOUT relationship, we can easily explore the sources that describe how these three persons interact with each other.

An initial interaction was reported in 1937 by the HMM Officer:

Brondsted said that he has established connections with N Bohr and will investigate the effects of X-radiation on protoplasm of sponges at the Finsen Institute.

Another ODE from 1939 further clarifies the role of Niels Bohr and August Krogh for the grant assignment.

Figure 22 — Example of ODEs showing interactions between Krogh, Bohr, and Brondsted

As you have seen from this example, by combining the GPT output, KG technology, and Hume features, we can easily and straightforwardly reconstruct the interactions between the researchers. This particular example shows how we can directly answer our initial requirement, which was related to understanding Niels Bohr's influence in supporting unusual topics based on his main research activity.

Lessons learned

This blog post shows that GPT and LLMs are opening incredible opportunities in constructing high-quality and large-scale KGs from textual data, addressing complex tasks such as Named Entity Recognition, Relation Extraction, and Entity Resolution in real scenarios. As emerged from this article, we adopted multiple models during our journey. Combining various models can help address less complex tasks more efficiently, using more advanced and expensive models for managing edge cases or more complicated situations. The example related to Niels Bohr suggests the enormous potential of combining LLMs with KGs for information discovery in natural language contents. Moreover, due to the flexibility of these technologies, the techniques adopted to investigate these cases can be transferred and generalised to other domains, such as journalism and law enforcement. To further show the power of combining LLMs and KG to perform more advanced insights and discoveries, a comprehensive analysis of the RAC KG will be a topic in one of the next blogs.

Knowledge Graph
Gpt
Named Entity Recognition
Relation Extraction
Entity Resolution
Recommended from ReadMedium