avatarYanli Liu

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

7516

Abstract

/span> response = openai.ChatCompletion.create( model=<span class="hljs-string">'gpt-3.5-turbo'</span>, <span class="hljs-comment"># Specify the model to use</span> messages=history_openai_format, <span class="hljs-comment"># Provide the formatted conversation history</span> temperature=<span class="hljs-number">0</span>, <span class="hljs-comment"># Set temperature to 0 for more focused and deterministic responses</span> )

<span class="hljs-comment"># Extract and return the content of the model's response</span>
<span class="hljs-keyword">return</span> response.choices[<span class="hljs-number">0</span>].message[<span class="hljs-string">"content"</span>]</pre></div><h2 id="7ab7">4. Create the chatbot with only one line of code!</h2><div id="e0df"><pre>gr<span class="hljs-selector-class">.ChatInterface</span>(get_completion)<span class="hljs-selector-class">.queue</span>()<span class="hljs-selector-class">.launch</span>()</pre></div><p id="648a">This single line of code will create a look like this.</p><figure id="ce2c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*tzv9G45Vpy-BFFKB75h4AA.png"><figcaption></figcaption></figure><h2 id="1202">5. Add Streaming And Memory to Your Chatbot</h2><p id="2d66">Now, let’s further improve the user experience of the chatbot above by streaming the model’s responses. Here’s the code to achieve that:</p><div id="c418"><pre><span class="hljs-keyword">def</span> <span class="hljs-title function_">get_completion_with_streaming</span>(<span class="hljs-params">message, history</span>):
history_openai_format = []

<span class="hljs-comment"># Iterate through the conversation history (a list of tuples with human and assistant messages)</span>
<span class="hljs-keyword">for</span> human, assistant <span class="hljs-keyword">in</span> history:
    <span class="hljs-comment"># Add the user's message to the formatted history with the role "user"</span>
    history_openai_format.append({<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: human })

    <span class="hljs-comment"># Add the assistant's response to the formatted history with the role "assistant"</span>
    history_openai_format.append({<span class="hljs-string">"role"</span>: <span class="hljs-string">"assistant"</span>, <span class="hljs-string">"content"</span>: assistant})

<span class="hljs-comment"># Add the current user's message to the formatted history</span>
history_openai_format.append({<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: message})

<span class="hljs-comment"># Make an API request to OpenAI's ChatCompletion model with streaming enabled</span>
response = openai.ChatCompletion.create(
    model=<span class="hljs-string">'gpt-3.5-turbo'</span>,  <span class="hljs-comment"># Specify the model to use</span>
    messages=history_openai_format,  <span class="hljs-comment"># Provide the formatted conversation history</span>
    temperature=<span class="hljs-number">1.0</span>,  <span class="hljs-comment"># Set temperature to control the randomness of responses</span>
    stream=<span class="hljs-literal">True</span>  <span class="hljs-comment"># Enable streaming mode for partial responses</span>
)

<span class="hljs-comment"># Initialize a variable to hold the partial message</span>
partial_message = <span class="hljs-string">""</span>

<span class="hljs-comment"># Iterate through the response chunks</span>
<span class="hljs-keyword">for</span> chunk <span class="hljs-keyword">in</span> response:
    <span class="hljs-comment"># Check if the chunk contains content</span>
    <span class="hljs-keyword">if</span> <span class="hljs-built_in">len</span>(chunk[<span class="hljs-string">'choices'</span>][<span class="hljs-number">0</span>][<span class="hljs-string">'delta'</span>]) != <span class="hljs-number">0</span>:
        <span class="hljs-comment"># Append the content of the chunk to the partial message</span>
        partial_message = partial_message + chunk[<span class="hljs-string">'choices'</span>][<span class="hljs-number">0</span>][<span class="hljs-string">'delta'</span>][<span class="hljs-string">'content'</span>]

        <span class="hljs-comment"># Yield the partial message, allowing for streaming responses</span>
        <span class="hljs-keyword">yield</span> partial_message</pre></div><p id="dd43">With streaming, the user doesn’t have to wait as long for a message to be generated.</p><figure id="553a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*bFZPgPcdNTk7heEkhu-aCA.gif"><figcaption>Streaming the LLM output to enhance the user experience</figcaption></figure><h1 id="d506">Chatting With Mistral 7b</h1><p id="98da">In this example, we’ll load and run the fine-tuned Mistral 7B on a Google Colab instance using the <b>Transformers </b>library.</p><p id="8a5d">Even with quantization to efficiently reduce memory usage, the model is still too large to run on a free Colab instance, so you’ll need a Pro account. We’ll not see how to run the model in this post, but you can find the accompanying <a href="https://colab.research.google.com/drive/1_iDmZfqp6dtdxr1ZhMMZ7W6JQpW27Hn6?authuser=0#scrollTo=UpX5puR7y_RJ">Colab notebook here.</a></p><p id="84d7">Alternatively, if you want to run the model locally, you can check out the <a href="https://github.com/huggingface/text-generation-inference">text-generation-inference</a>.</p><h2 id="967d">1. Define your chat function</h2><p id="a80e">The chat function takes user query and conversation history, sends them to the model to generate a response, and yields the generated response in a streaming manner.</p><div id="d437"><pre><span class="hljs-comment"># Import necessary libraries and modules</span>

import gradio as gr import torch from transformers import <span class="hljs-title class_">AutoModelForCausal</span>LM, <span class="hljs-title class_">AutoTokenizer</span>, <span class="hljs-title class_">StoppingCriteria</span>, <span class="hljs-title class_">StoppingCriteriaList</span>, <span class="hljs-title class_">TextIteratorStreamer</span> from threading import <span class="hljs-title class_">Thread</span>

<span class="hljs-comment"># Move the model to the GPU (cuda:0)</span> model = model.to(<span class="hljs-string">'cuda:0'</span>)

<span class="hljs-comment"># Define a custom StoppingCriteria class for text generation</span> <span class="hljs-keyword">class</span> <span class="hljs-title class_">StopOnTokens</span>(<span class="hljs-title class_">StoppingCriteria</span>): <span class="hljs-keyword">def</span> <span class="hljs-title function_">call</span>(<span class="hljs-params"><span class="hljs-variable language_">self</span>, <span class="hljs-symbol">input_ids:</span> torch.<span class="hljs-title class_">LongTensor</span>, <span class="hljs-symbol">scores:</span> torch.<span class="hljs-title class_">FloatTensor</span>, **kwargs</span>) -> <span class="hljs-symbol">bool:</span> <span class="hljs-comment"># Define stop tokens (e.g., [29, 0]) that determine when to stop text generation</span> stop_ids = [<span class="hljs-number">29</span>, <span class="hljs-number">0</span>] <span class="hljs-keyword">for</span> stop_id <span class="hljs-keyword

Options

">in</span> <span class="hljs-symbol">stop_ids:</span> <span class="hljs-keyword">if</span> input_ids[<span class="hljs-number">0</span>][-<span class="hljs-number">1</span>] == <span class="hljs-symbol">stop_id:</span> <span class="hljs-keyword">return</span> <span class="hljs-title class_">True</span> <span class="hljs-keyword">return</span> <span class="hljs-title class_">False</span>

<span class="hljs-comment"># Define a function called predict for text generation</span> <span class="hljs-keyword">def</span> <span class="hljs-title function_">predict</span>(<span class="hljs-params">message, history</span>): <span class="hljs-comment"># Combine the user's message and conversation history</span> history_transformer_format = history + [[message, <span class="hljs-string">""</span>]]

<span class="hljs-comment"># Create an instance of the custom StoppingCriteria class</span>
stop = <span class="hljs-title class_">StopOnTokens</span>()

<span class="hljs-comment"># Prepare the conversation history in a specific format</span>
messages = <span class="hljs-string">""</span>.join([<span class="hljs-string">""</span>.join([<span class="hljs-string">"\n&lt;human&gt;:"</span>+item[<span class="hljs-number">0</span>], <span class="hljs-string">"\n&lt;bot&gt;:"</span>+item[<span class="hljs-number">1</span>]])
            <span class="hljs-keyword">for</span> item <span class="hljs-keyword">in</span> history_transformer_format])

<span class="hljs-comment"># Tokenize the formatted conversation history and move it to the GPU</span>
model_inputs = tokenizer([messages], return_tensors=<span class="hljs-string">"pt"</span>).to(<span class="hljs-string">"cuda"</span>)

<span class="hljs-comment"># Create a TextIteratorStreamer to iterate over generated tokens</span>
streamer = <span class="hljs-title class_">TextIteratorStreamer</span>(tokenizer, timeout=<span class="hljs-number">10</span>., skip_prompt=<span class="hljs-title class_">True</span>, skip_special_tokens=<span class="hljs-title class_">True</span>)

<span class="hljs-comment"># Define text generation parameters</span>
generate_kwargs = dict(
    model_inputs,
    streamer=streamer,
    max_new_tokens=<span class="hljs-number">1024</span>,
    do_sample=<span class="hljs-title class_">True</span>,
    top_p=<span class="hljs-number">0.95</span>,
    top_k=<span class="hljs-number">1000</span>,
    temperature=<span class="hljs-number">1.0</span>,
    num_beams=<span class="hljs-number">1</span>,
    stopping_criteria=<span class="hljs-title class_">StoppingCriteriaList</span>([stop])
)

<span class="hljs-comment"># Start text generation in a separate thread</span>
t = <span class="hljs-title class_">Thread</span>(target=model.generate, kwargs=generate_kwargs)
t.start()

partial_message = <span class="hljs-string">""</span>

<span class="hljs-comment"># Iterate over generated tokens and yield partial messages</span>
<span class="hljs-keyword">for</span> new_token <span class="hljs-keyword">in</span> <span class="hljs-symbol">streamer:</span>
    <span class="hljs-keyword">if</span> new_token != <span class="hljs-string">'&lt;'</span>:
        partial_message += new_token
        <span class="hljs-keyword">yield</span> partial_message</pre></div><h2 id="b6fa">2. Try it out with Gradio one line magic!</h2><div id="3f6e"><pre>gr<span class="hljs-selector-class">.ChatInterface</span>(predict)<span class="hljs-selector-class">.queue</span>()<span class="hljs-selector-class">.launch</span>()</pre></div><p id="af8c">This code will launch the Gradio interface and let’s chat with our Mistral 7B!</p><h2 id="a7a0">3. Improve the Chatbot UI with customization</h2><p id="8032">Now that we’re familiar with Gradio’s ChatInterface, we can further customize the look and feel of the chatbot. For example, we can add a title and description above the chatbot, and show examples to make it easier for users to try it out.</p><div id="4bab"><pre><span class="hljs-selector-tag">gr</span><span class="hljs-selector-class">.ChatInterface</span>(
predict,
chatbot=gr.<span class="hljs-built_in">Chatbot</span>(<span class="hljs-attribute">height</span>=<span class="hljs-number">300</span>),
textbox=gr.<span class="hljs-built_in">Textbox</span>(placeholder=<span class="hljs-string">"Send a message"</span>, container=False, scale=<span class="hljs-number">7</span>),
title=<span class="hljs-string">"Chat with Finance Mistral 7B"</span>,
description=<span class="hljs-string">"Ask me any questions on finance"</span>,
theme=<span class="hljs-string">"soft"</span>,
examples=[<span class="hljs-string">"Will capital gains affect my tax bracket?"</span>, <span class="hljs-string">"What are the common income tax deductions used by rich salaried households?"</span>],
cache_examples=True,
retry_btn=None,
undo_btn=<span class="hljs-string">"Delete Previous"</span>,
clear_btn=<span class="hljs-string">"Clear"</span>,

)<span class="hljs-selector-class">.queue</span>()<span class="hljs-selector-class">.launch</span>()</pre></div><p id="1b11">This code snippet will give the Chatbot a look like this.</p><figure id="2ef1"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*gLBSYrVVU1UHCMhwtQEYeQ.png"><figcaption></figcaption></figure><h1 id="c9ba">Closing Thoughts</h1><p id="8e9d">In this article, we have created a chatbot powered by OpenAI, a demo to chat with the finetuned Mistral 7B model, and shown how to stream model output to improve the user experience.</p><p id="376e">Gradio makes it really easy for anyone to create machine learning demos, with just a few lines of code. So, even if you’re not a front-end developer, there’s no excuse not to put in the extra mile and create a great UI for your demo. <b>It will make a big difference!</b></p><p id="0687">Alternatively you can use <i>Gradio Bloc </i>if you want to more control and adding more customization.</p><p id="a9aa"><i>Teaser </i>: <i>Gradio 4 Coming Soon on October 31st. </i>Gradio 4 is the next major release, allowing to do MUCH more with your machine learning apps, , so stay tuned!</p><h2 id="8b14">Before you go! 🦸🏻‍♀️</h2><p id="7050">If you liked my story and you want to support me:</p><ol><li>Clap my article 50 times, that will really really help me out.👏</li><li><a href="https://medium.com/@ronal999.liu">Follow me </a>on Medium and subscribe to get my latest article🫶</li></ol><div id="3893" class="link-block"> <a href="https://medium.com/@yanli.liu/subscribe"> <div> <div> <h2>Get an email whenever Yanli Liu publishes.</h2> <div><h3>Get an email whenever Yanli Liu publishes. By signing up, you will create a Medium account if you don't already have…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*LUStprRWUsLG9LgD)"></div> </div> </div> </a> </div><p id="c1ab"><b>Reference</b></p><ol><li><a href="https://www.gradio.app/guides/creating-a-chatbot-fast">How to Create a Chatbot with Gradio</a></li><li><a href="https://arxiv.org/abs/1906.02569">Gradio: Hassle-Free Sharing and Testing of ML Models in the Wild</a></li></ol></article></body>

How I Created Easy Gen AI Demos for Non-Technical Colleagues and Users

An Easy Guide to Quick Demos With Few Lines of Code Using Gradio

In this post, we’ll see how to write quick machine learning and Gen AI demos with few lines of code using Gradio.

Specifically, you’ll learn how to:

  1. Write an interactive chatbot powered by OpenAI GPT-3.5.
  2. Stream the response and add memory to your chatbot to improve the user experience.
  3. Write a chatbot with a customized look to chat with your fine-tuned Mistral 7B, which has been trained on financial knowledge data (see the following article).
Image generated by the author using Bing chat powered by Dall E 3

I have been working in financial services for years, surrounded by colleagues with economic and risk management backgrounds. Some of them are also really good at numbers and models (risk models, credit models, but not Large Language Models).

When my proposal for a side-project on leverage on LLM to improve the team’s productivity was accepted, one of the challenges I faced was how to communicate complex machine learning concepts to my collegues who are non-IT audiences.

One solution I found was to create interactive demos.

Demos are a great way to show people how machine learning works in practice and to let them try out the demo in their browsers.

Demo by the author

However, creating demos can be time-consuming and challenging, especially if you’re not a web developer.

That’s where Gradio comes in.

What Is Gradio?

Gradio is an open-source Python library that is used to build machine learning and data science demos and web applications.

With Gradio, you can quickly create a beautiful user interface around your machine learning models and let people try them out and interact with your demo, all through the browser.

Gradio is useful for:

  • Demoing your machine learning models for clients, collaborators, users, or students.
  • Deploying your models quickly with automatic shareable links and getting feedback on model performance.
  • Debugging your model interactively during development using built-in manipulation and interpretation tools.

Chatbot Powered By OpenAI GPT3.5

You can find the accompanying Colab notebook here. The code snipets shown here are largely inspired by the Gradio documentation.

1. Install necessary packages

!pip install -q gradio
!pip install openai
!pip install tiktoken

2. Get your OpenAI API key

you’ll need an OpenAI API key to get access to OpenAI’s language models. To obtain your key, visit the OpenAI developer portal, sign up, and retrieve your API key.

OPEN_API_KEY = your_api_key

3. Define your chat function

When working with gradio ChatInterface API, the first thing we should do is define the chat function. The chat function should take two arguments: message and then history (the arguments can be named anything, but must be in this order).

  • message: a str representing the user’s input.
  • history: a list of list representing the conversations up until that point. Each inner list consists of two str representing a pair: [user input, bot response].

This function should return a single string response, which is the bot’s response to the particular user input message.

In short, the chat function sends the user’s query and the previous conversation history to GPT-3.5 Turbo and returns the model response.

Adding the chatting history to the chatbot is important because it allows the user to ask follow-up questions, as each interaction with GPT-3.5 Turbo is standalone.

# Import the necessary libraries
import openai  # Import OpenAI library for making API requests
import gradio as gr  # Import Gradio for creating a user interface

# Set the OpenAI API key - Replace OPENAI_API_KEY with your actual API key
openai.api_key = OPENAI_API_KEY

# Define a function called get_completion
def get_completion(message, history):
    history_openai_format = []

    # Iterate through the conversation history (a list of tuples with human and assistant messages)
    for human, assistant in history:
        # Add the user's message to the formatted history with the role "user"
        history_openai_format.append({"role": "user", "content": human })

        # Add the assistant's response to the formatted history with the role "assistant"
        history_openai_format.append({"role": "assistant", "content": assistant})

    # Add the current user's message to the formatted history
    history_openai_format.append({"role": "user", "content": message})

    # Make an API request to OpenAI's ChatCompletion model
    response = openai.ChatCompletion.create(
        model='gpt-3.5-turbo',  # Specify the model to use
        messages=history_openai_format,  # Provide the formatted conversation history
        temperature=0,  # Set temperature to 0 for more focused and deterministic responses
    )

    # Extract and return the content of the model's response
    return response.choices[0].message["content"]

4. Create the chatbot with only one line of code!

gr.ChatInterface(get_completion).queue().launch()

This single line of code will create a look like this.

5. Add Streaming And Memory to Your Chatbot

Now, let’s further improve the user experience of the chatbot above by streaming the model’s responses. Here’s the code to achieve that:

def get_completion_with_streaming(message, history):
    history_openai_format = []

    # Iterate through the conversation history (a list of tuples with human and assistant messages)
    for human, assistant in history:
        # Add the user's message to the formatted history with the role "user"
        history_openai_format.append({"role": "user", "content": human })

        # Add the assistant's response to the formatted history with the role "assistant"
        history_openai_format.append({"role": "assistant", "content": assistant})

    # Add the current user's message to the formatted history
    history_openai_format.append({"role": "user", "content": message})

    # Make an API request to OpenAI's ChatCompletion model with streaming enabled
    response = openai.ChatCompletion.create(
        model='gpt-3.5-turbo',  # Specify the model to use
        messages=history_openai_format,  # Provide the formatted conversation history
        temperature=1.0,  # Set temperature to control the randomness of responses
        stream=True  # Enable streaming mode for partial responses
    )

    # Initialize a variable to hold the partial message
    partial_message = ""

    # Iterate through the response chunks
    for chunk in response:
        # Check if the chunk contains content
        if len(chunk['choices'][0]['delta']) != 0:
            # Append the content of the chunk to the partial message
            partial_message = partial_message + chunk['choices'][0]['delta']['content']

            # Yield the partial message, allowing for streaming responses
            yield partial_message

With streaming, the user doesn’t have to wait as long for a message to be generated.

Streaming the LLM output to enhance the user experience

Chatting With Mistral 7b

In this example, we’ll load and run the fine-tuned Mistral 7B on a Google Colab instance using the Transformers library.

Even with quantization to efficiently reduce memory usage, the model is still too large to run on a free Colab instance, so you’ll need a Pro account. We’ll not see how to run the model in this post, but you can find the accompanying Colab notebook here.

Alternatively, if you want to run the model locally, you can check out the text-generation-inference.

1. Define your chat function

The chat function takes user query and conversation history, sends them to the model to generate a response, and yields the generated response in a streaming manner.

# Import necessary libraries and modules
import gradio as gr
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer
from threading import Thread

# Move the model to the GPU (cuda:0)
model = model.to('cuda:0')

# Define a custom StoppingCriteria class for text generation
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        # Define stop tokens (e.g., [29, 0]) that determine when to stop text generation
        stop_ids = [29, 0]
        for stop_id in stop_ids:
            if input_ids[0][-1] == stop_id:
                return True
        return False

# Define a function called predict for text generation
def predict(message, history):
    # Combine the user's message and conversation history
    history_transformer_format = history + [[message, ""]]

    # Create an instance of the custom StoppingCriteria class
    stop = StopOnTokens()

    # Prepare the conversation history in a specific format
    messages = "".join(["".join(["\n<human>:"+item[0], "\n<bot>:"+item[1]])
                for item in history_transformer_format])

    # Tokenize the formatted conversation history and move it to the GPU
    model_inputs = tokenizer([messages], return_tensors="pt").to("cuda")

    # Create a TextIteratorStreamer to iterate over generated tokens
    streamer = TextIteratorStreamer(tokenizer, timeout=10., skip_prompt=True, skip_special_tokens=True)

    # Define text generation parameters
    generate_kwargs = dict(
        model_inputs,
        streamer=streamer,
        max_new_tokens=1024,
        do_sample=True,
        top_p=0.95,
        top_k=1000,
        temperature=1.0,
        num_beams=1,
        stopping_criteria=StoppingCriteriaList([stop])
    )

    # Start text generation in a separate thread
    t = Thread(target=model.generate, kwargs=generate_kwargs)
    t.start()

    partial_message = ""
    
    # Iterate over generated tokens and yield partial messages
    for new_token in streamer:
        if new_token != '<':
            partial_message += new_token
            yield partial_message

2. Try it out with Gradio one line magic!

gr.ChatInterface(predict).queue().launch()

This code will launch the Gradio interface and let’s chat with our Mistral 7B!

3. Improve the Chatbot UI with customization

Now that we’re familiar with Gradio’s ChatInterface, we can further customize the look and feel of the chatbot. For example, we can add a title and description above the chatbot, and show examples to make it easier for users to try it out.

gr.ChatInterface(
    predict,
    chatbot=gr.Chatbot(height=300),
    textbox=gr.Textbox(placeholder="Send a message", container=False, scale=7),
    title="Chat with Finance Mistral 7B",
    description="Ask me any questions on finance",
    theme="soft",
    examples=["Will capital gains affect my tax bracket?", "What are the common income tax deductions used by rich salaried households?"],
    cache_examples=True,
    retry_btn=None,
    undo_btn="Delete Previous",
    clear_btn="Clear",
).queue().launch()

This code snippet will give the Chatbot a look like this.

Closing Thoughts

In this article, we have created a chatbot powered by OpenAI, a demo to chat with the finetuned Mistral 7B model, and shown how to stream model output to improve the user experience.

Gradio makes it really easy for anyone to create machine learning demos, with just a few lines of code. So, even if you’re not a front-end developer, there’s no excuse not to put in the extra mile and create a great UI for your demo. It will make a big difference!

Alternatively you can use Gradio Bloc if you want to more control and adding more customization.

Teaser : Gradio 4 Coming Soon on October 31st. Gradio 4 is the next major release, allowing to do MUCH more with your machine learning apps, , so stay tuned!

Before you go! 🦸🏻‍♀️

If you liked my story and you want to support me:

  1. Clap my article 50 times, that will really really help me out.👏
  2. Follow me on Medium and subscribe to get my latest article🫶

Reference

  1. How to Create a Chatbot with Gradio
  2. Gradio: Hassle-Free Sharing and Testing of ML Models in the Wild
Data Science
Machine Learning
Programming
Artificial Intelligence
Recommended from ReadMedium