
LANGCHAIN — How to Improve Documentation
Computer science is no more about computers than astronomy is about telescopes. — Edsger W. Dijkstra
Improving documentation is a critical aspect of any software development project. Clear and detailed documentation not only helps developers understand the functionality of the code but also aids in the onboarding process for new team members. In this article, we will explore the use of large language models (LLMs) to improve documentation by summarizing user questions and identifying documentation gaps. We will delve into the code snippets and examples to demonstrate how to implement LLMs for this purpose.
Using Large Language Models (LLMs) for Documentation Improvement
We aim to leverage LLMs to summarize and identify documentation gaps from a large dataset of user questions collected using Mendable, an AI-enabled chat application. Two approaches have been experimented with:
Approach 1: Map-Reduce
This approach involves splitting questions into groups based on the context window of either GPT-3.5–16k or Claude-2, summarizing each group, and then consolidating them into a final synthesis.
# Map-Reduce approach
from langchain import Chain
# Split questions into groups
question_groups = split_questions(context_window='GPT-3.5-16k')
# Summarize each group
group_summaries = []
for group in question_groups:
summary = Chain.summarize(group)
group_summaries.append(summary)
# Consolidate into a final synthesis
final_synthesis = consolidate_summaries(group_summaries)Approach 2: k-Means Clustering and Summarization
This approach involves clustering embedded questions followed by GPT-4 summarization of each cluster.
# k-Means clustering and summarization
from sklearn.cluster import KMeans
from langchain import GPT4
# Cluster embedded questions
clusters = KMeans(n_clusters=10).fit(embedded_questions)
# Summarize each cluster using GPT-4
cluster_summaries = []
for cluster in clusters:
summary = GPT4.summarize(cluster)
cluster_summaries.append(summary)Results and Analysis
The results of the end-to-end LLM summarization pipeline were analyzed, and the trade-offs between the two approaches were examined. Detailed results, including the distribution of question themes summarized in different experiments, were gathered and summarized in a table.
# Distribution of question themes summarized in different experiments
import pandas as pd
# Create a DataFrame for detailed results
results = pd.DataFrame({
'Experiment': ['Map-Reduce', 'k-Means Clustering'],
'Theme 1': ['15%', '10%'],
'Theme 2': ['12%', '7%'],
# ... (other themes)
})Additionally, granular thematic breakdown of the top 10 questions related to loading, processing, and manipulating different types of data and documents was provided.
Conclusion
In conclusion, both approaches have their trade-offs. The map-reduce approach provides high customizability but comes with a higher cost, as indicated by token usage. On the other hand, clustering offers lower cost and may be a sensible way to quickly compress very large datasets. However, it risks information loss due to hand-tuning in the preprocessing stage. The thoughtful union of these two methods offers considerable promise for addressing the challenge of documentation improvement.
By open-sourcing the notebooks and the data, this analysis can be reproduced, and the community can benefit from the insights gained.
In summary, leveraging LLMs for documentation improvement can significantly enhance the navigability and user experience of the documentation. By carefully considering the trade-offs and experimenting with different approaches, we can continuously improve and refine the documentation to meet the evolving needs of the users.






