
LANGCHAIN — What Is TUNA and How Is It Used to Generate Synthetic Fine-Tuning Datasets Quickly?
Technological change is not additive; it is ecological. A new technology does not merely add something; it changes everything. — Neil Postman
TUNA, a no-code tool, allows for the rapid generation of fine-tuning datasets for large language models (LLMs) like GPT-3.5-turbo or LLaMa-2–7b. TUNA uses OpenAI’s GPT model to create prompt-completion pairs based on input text data. This article provides a detailed tutorial on using the TUNA web interface and Python script to generate synthetic fine-tuning datasets quickly.
Web Interface Tutorial
The TUNA web interface allows you to quickly generate prompt-completion pairs. After supplying your OpenAI key and a single column CSV file, TUNA requests prompt-completion pairs from GPT-3.5-turbo/GPT-4 for each text in the column. The interface provides three versions: SimpleQA, MultiChunk, and CustomPrompt, each suitable for different fine-tuning needs.
Python Script Tutorial
For larger datasets, the Python script offers a faster solution. It utilizes asyncio to handle more concurrent requests than the web interface. After setting the OpenAI key in the Repl.it Secrets page and uploading the CSV file, the script generates the output in a file named output.csv.
Sample Datasets and Fine-tuned Models
The author shares the results of fine-tuning LLaMa-7b using datasets generated by TUNA. The synthetic datasets Sassy-Aztec-qa-13k and Roman-Empire-qa-27k were created using TUNA and used for fine-tuning LLaMa-7b. The article demonstrates comparisons between the base model and the fine-tuned models on various text completion tasks.
Conclusion
The article concludes with an overview of LangSmith, a service to manage and convert fine-tuning datasets. It also encourages users to share their datasets on Hugging Face and provides links to integrate the fine-tuned models with LangChain.
Through TUNA, the author aims to simplify the process of generating fine-tuning datasets and contribute to the open source LLM community.
