How to Build Your Own AI Chatbot with Custom Data
Introduction
Creating an AI-powered chatbot tailored to your specific needs has never been more straightforward. This guide will walk you through the process of training an AI chatbot with custom data using OpenAI’s ChatGPT and Python. By the end of this tutorial, you’ll be equipped with the knowledge to set up, curate your data, and optimize your chatbot.

Prerequisites
Before we begin, ensure that you have the following prerequisites:
- Python 3.0 or later installed on your system.
- Basic knowledge of Python programming.
- An OpenAI account to access the API key.
Step 1: Updating Pip
Assuming you already have Python Installed on your machine. You can check by running the following command in your terminal:
python3 --version
You can download the Python installer from the official Python website. After the installation is complete, run the above command again to verify the installation. Python comes with pip pre-installed, but it’s advisable to update pip to the latest version if you’re using an older installation. You can update it using the following command:
python3 -m pip install -U pip
If pip is already up-to-date, the terminal will display a warning. If not, it will proceed with the installation. You can confirm the successful installation by executing the following command:
pip --version
Step 2: Library Installation
Before we delve into the training process, we need to install some essential libraries. These libraries provide various functionalities that are crucial to our chatbot training process. Open the Terminal and execute the following commands one after the other:
pip3 install openai
pip3 install gpt_index
pip3 install PyPDF2
pip3 install gradio
pip3 install pycryptodome
pip3 install pypdf
pip3 install llama-index
Let’s understand what each of these libraries does:
openai
: This is the official OpenAI library that allows us to interact with the OpenAI API and use models like GPT-3.gpt_index
: Also known as LlamaIndex, this library is used to connect the Language Model (LM) to the external data that serves as our knowledge base.PyPDF2
: This is a Python library used to read and write PDF files. It's essential if you're planning to feed PDF files to the model.gradio
: Gradio makes it easy to create a UI for your ML model, which is useful for interacting with our AI chatbot.pycryptodome
: This is a self-contained Python package of low-level cryptographic primitives. It's an effective toolbox if you need to encrypt or decrypt data.pypdf
: Similar to PyPDF2, PyPDF is a library used for PDF document manipulation. It can extract text, metadata, and even images from PDFs.llama-index
: This is an updated version of the gpt_index library. It's used for creating an index of documents that can be queried using natural language with the help of a language model.
These libraries form the backbone of our chatbot training process, providing the necessary tools and functionalities to create, train, and interact with our AI chatbot.
Step 3: Setting Up Payment Method and Acquiring OpenAI Key
Before we can start scripting, we need to obtain the API key from OpenAI. However, OpenAI requires users to set up a payment method before they can generate an API key. Here’s how you can do it:
- Visit the OpenAI website and log in to your account.
- Navigate to the ‘Billing’ section in your account settings.
- Here, you can add a payment method. OpenAI accepts most credit and debit cards. Enter your card details and save them.
- Once your payment method is set up and verified, you can now generate an API key.

To generate an API key:
- Navigate to the ‘API Keys’ section in your account settings.
- Click on “Create new secret key” to generate a key for our script.
- A dialog box will appear with your new key. Make sure to copy and securely store the key as you won’t be able to retrieve it again.

Now, you have successfully set up a payment method and acquired an API key from OpenAI. This key will be used in our script to authenticate our requests to the OpenAI API.
Step 4: Data Curation
Create a new directory named ‘MyData’ and populate it with PDF, TXT, or CSV files. You can add multiple files, but keep in mind that the more data you add, the more tokens will be consumed.

Step 5: Script Creation
With all the prerequisites in place, our next step is to create a Python script to train the chatbot with custom data. This script will use the files inside the ‘MyData’ directory and generate a JSON file. Create the Python file in the same folder as your ‘MyData’ folder. I called my ‘app.py’

Here is the Python script:
import openai
from llama_index import SimpleDirectoryReader, GPTVectorStoreIndex, LLMPredictor, ServiceContext
from langchain import OpenAI
import gradio as gr
import os
API_KEY = 'your-api-key'
DIRECTORY_PATH = "MyData"
model_name="gpt-3.5-turbo"
openai.api_key = API_KEY
def construct_index(directory_path):
num_outputs = 512
llm_predictor = LLMPredictor(llm=OpenAI(openai_api_key=API_KEY, temperature=0.7, model_name=model_name, max_tokens=num_outputs))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)
docs = SimpleDirectoryReader(directory_path).load_data()
index = GPTVectorStoreIndex.from_documents(docs, service_context=service_context)
return index
index = construct_index(DIRECTORY_PATH)
def chatbot(input_text):
query_engine = index.as_query_engine()
response = query_engine.query(input_text)
return response.response
iface = gr.Interface(fn=chatbot,
inputs=gr.Textbox(lines=7, label="Enter your text"),
outputs="text",
title="Custom-trained AI Chatbot")
iface.launch(share=True)
Replace'your-api-key'
with your actual OpenAI key, which you copied from the OpenAI website in Step 3.
Choosing the Right Model for Your Use Case
OpenAI offers a diverse set of models, each with different capabilities, price points, and use cases. Here is where you can learn more about the models available:
Simply replace model_name=”gpt-3.5-turbo” with your choice of model.
Step 6: Bringing the Chatbot to Life
With everything set up, we can now run the script and bring our chatbot to life. Navigate to the location where you have the ‘app.py’ and ‘MyData’ directory. Open Terminal and run the following command:
python3 app.py
This command initiates the training of your custom chatbot. The duration of this process will depend on the volume of data you have provided. Once the training is complete, a link will be generated where you can test the chatbot’s responses using a simple UI.

You can open this link in any browser and start interacting with your custom-trained chatbot. Remember, asking questions and training the chatbot consumes tokens from your OpenAI account.

I trained the bot with my personal information.
To train the chatbot with more or different data, you can terminate the program using CTRL + C, modify the files in the ‘MyData’ directory, and then run the Python file again. Enjoy the process of building and testing your custom AI chatbot!
Step 7: Tracking Your Token Usage
As you train and interact with your AI chatbot, it’s important to keep track of your token usage. Tokens are the units of text that language models read. In English, a token can be as short as one character or as long as one word. For example, “ChatGPT is great!” is encoded into six tokens: [“Chat”, “G”, “PT”, “ is”, “ great”, “!”].
OpenAI charges users based on the number of tokens processed, which includes both the input and output tokens. Therefore, monitoring your token usage can help you manage your costs effectively.
Here’s how you can track your token usage:
OpenAI provides a dashboard where you can monitor your API usage. You can access this dashboard by logging into your OpenAI account and navigating to the ‘API’ section. Here, you’ll find a detailed breakdown of your token usage, including the number of tokens used for requests and the total cost.

Remember, the number of tokens affects not only the cost but also whether the API call works at all, as there is a maximum limit to the number of tokens that can be processed in a single API call. By keeping track of your token usage, you can manage your costs and ensure that your API calls run successfully.
Bonus:
You can use Streamlit instead of Gradio for your User Interface. Just replace the above Python program with this:
import openai
from llama_index import SimpleDirectoryReader, GPTVectorStoreIndex, LLMPredictor, ServiceContext
from langchain import OpenAI
import gradio as gr
import os
from langchain.chat_models import ChatOpenAI
API_KEY = 'your-api-key'
DIRECTORY_PATH = "MyData"
model_name="gpt-3.5-turbo"
openai.api_key = API_KEY
def construct_index(directory_path):
num_outputs = 512
llm_predictor = LLMPredictor(llm=OpenAI(openai_api_key=API_KEY, temperature=0.7, model_name=model_name, max_tokens=num_outputs))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)
docs = SimpleDirectoryReader(directory_path).load_data()
index = GPTVectorStoreIndex.from_documents(docs, service_context=service_context)
return index
index = construct_index(DIRECTORY_PATH)
def chatbot(input_text):
query_engine = index.as_query_engine()
response = query_engine.query(input_text)
return response.response
iface = gr.Interface(fn=chatbot,
inputs=gr.Textbox(lines=7, label="Enter your text"),
outputs="text",
title="Custom-trained AI Chatbot")
iface.launch(share=True)
To run this program, simply execute this command in your terminal:
streamlit run app.py
It will automatically open up the application on your browser:

Alternatively, you can always write your own Frontend application using HTML, CSS, and Flask as I did:

Thanks for Reading!
If you like my work and want to support me…