Summary
The web content provides a comprehensive guide on integrating Google's Gemini 1.5 Pro API into a chat web application to enable multi-modal interactions, including text, images, videos, and audios, and compares its capabilities and costs with OpenAI's GPT-4o.
Abstract
The provided web content delves into the integration of Google's Gemini 1.5 Pro API with an existing chat web application, detailing the steps required to leverage its advanced multi-modal capabilities. It begins by introducing the Gemini 1.5 models, highlighting their superior context window lengths and native handling of various data modalities compared to OpenAI's GPT-4o. The article then guides the reader through setting up the Gemini API, including obtaining API keys, configuring Python environments, and experimenting with text, image, and video interactions. It also demonstrates how to maintain chat history and integrate the Gemini models with the OpenAI models within the Streamlit app. The content concludes with instructions on deploying the updated Streamlit OmniChat app online for free, emphasizing the potential issues with video file handling and encouraging reader engagement through applause and subscription prompts.
Opinions
Although OpenAI tried to eclipse it the day before announcing GPT-4o, on May’s 14th 2024, Google announced an update of the Gemini Flash and Pro models during the Keynote of Google I/O ‘24.
These new models really compete with OpenAI as they are better in certain areas:

Even so, Gemini 1.5 Pro is slightly worse than GPT-4 Turbo and GPT-4o in terms of general accuracy, information quality, code generation and other different tasks.
When compared to the previous generation of Gemini models:
Gemini 1.5 Pro achieves comparable quality to Gemini 1.0 Ultra, while using less compute (article)
What’s more, Gemini 1.0 Ultra, which was the first model to outperform human experts on MMLU (Massive Multitask Language Understanding), had only 32K tokens of context windows. This is 30–60 times less context than these new ones. One may think it cannot be much difficult to have larger context windows, only by increasing the size of some of the internal LLM vectors could do it. The problem is that this not only scales quadratically in terms of compute cost, but also it often makes parts of the input context lost for the model at the time to predict the response based on them. This problem is evaluated with the “Needle In A Haystack” (NIAH) evaluation, where a small piece of text containing a particular fact or statement is purposely placed within a long block of text. So, Gemini 1.5 Pro found the embedded text 99% of the time, in blocks of data as long as 1 million tokens.
One of the key components making the 1.5 models successful is a new and secret architecture of MoE (Mixture of Experts) and Transformers that allow them to be better at a wider amount of tasks while being smaller and faster.
In order to use the Gemini API, we will need to get our API token first. You can get it from the Google AI Studio website:

For now, the use of the API is free with some rate limits, but if you are from Europe, UK or Switzerland you will need to create a billing account and pay for it already (I had to do it and so far I spent less than 10 cents for many different experiments with text, images, and videos, so it’s cheap 😛).
Then create a folder for your project if you haven’t done it yet, open your favorite IDE (I will use VSCode), and create a .env file where we will place our API Key as an environment variable (if you saw my previous blog on how to use the OpenAI API, you can just add this new key next to the other ones):
# /.env
GOOGLE_API_KEY=<your-api-key>
We will need to install the following Python libraries from the terminal:
# You can create and activate a virtual environment first if you want
pip install python-dotenv google-generativeai iprogress ipykernelNow we will create or open our api_experiments.ipynb file and introduce the following code:
import os
import google.generativeai as genai
import dotenv
dotenv.load_dotenv()
genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
# Set up the model
generation_config = {
"temperature": 0.2,
"top_p": 0.8,
"top_k": 64,
"max_output_tokens": 8192,
}
model = genai.GenerativeModel(
model_name="gemini-1.5-flash",
generation_config=generation_config,
)
response = model.generate_content("What is the chemical formula of glucose?")
try:
print(response.text)
except Exception as e:
print("Exception:\n", e, "\n")
print("Response:\n", response.candidates)The chemical formula of glucose is **C₆H₁₂O₆**.
Simple like this we can query the Gemini API with text. I had to add the try-except block as Google has implemented a quite strict filter and many basic questions can trigger it and return a response without text, but with some basic flags to understand why the request failed.
We can change the model configs and we can also change from the Gemini 1.5 flash to the pro, by setting model_name=”gemini-1.5-pro”:
model = genai.GenerativeModel(
model_name="gemini-1.5-pro",
generation_config=generation_config,
)Now let’s try to chat with an image, I have added the following image to my project’s folder as fridge_food.jpg:

And in a new code cell of our api_experiment.ipynb we will try to see if the model lists the food that we have:
generation_config = {
"temperature": 0.2,
"top_p": 0.8,
"top_k": 64,
"max_output_tokens": 8192,
}
model = genai.GenerativeModel(
model_name="gemini-1.5-flash",
generation_config=generation_config,
)
prompt_parts = [
genai.upload_file("fridge_food.jpg"),
"List the food items in the fridge and their quantities."
]
response = model.generate_content(prompt_parts)
try:
print(response.text)
except Exception as e:
print("Exception:\n", e, "\n")
print("Response:\n", response.candidates)
Nice, almost perfect! Gemini 1.5 Flash got all the ingredients and it counted almost all of them good. Let’s try it with the 1.5 Pro:

The result now is even better! I would say that perfect, specially taking into account that there are some fruits occluded and with differents colors and shapes.
So we can chat with image files, but also videos, audios, and documents with the same method: genai.upload_file(“<file-path>”)
If we want to have a longer conversation, we can add the chat history context in two different ways:
model = genai.GenerativeModel("gemini-1.5-flash")
chat_history = [
{
"role": "user",
"parts": ["Hi!"]
},
{
"role": "model",
"parts": ["Hi there! How can I help you today?"],
},
{
"role": "user",
"parts": ["Translate 'Large Language Models are awesome!' to French."],
}
]
response = model.generate_content(chat_history )
try:
print(response.text)
except Exception as e:
print("Exception:\n", e, "\n")
print("Response:\n", response.candidates)“Les grands modèles de langage sont géniaux !”
Where parts could have also files added to its list as we saw before, and the model response can be added iteratively to the chat history with the “role”: “model” in order to keep asking things to it while keeping the conversation context.
model = genai.GenerativeModel("gemini-1.5-flash", generation_config={"temperature": 0.3})
chat = model.start_chat(history=[])
prompt_parts = ["My favourite food is pizza."]
response = chat.send_message(prompt_parts)
print(response.text)That’s awesome! Pizza is a classic for a reason. What’s your favorite kind of pizza? 🍕
prompt_parts = ["What is my favourite food?"]
response = chat.send_message(prompt_parts)
print(response.text)As an AI, I don’t have access to your personal information, including your favorite food. You told me your favorite food is pizza, but is that still true? 😊
So we see that internally it keeps track of the chat conversation, we can check it like this if we want:
chat.history

In the following section, we will be using the first way in order to integrate Gemini 1.5 together with the OpenAI models that we already had in the OmniChat Streamlit app 💪 which will be the best one overall? 🤔
Now we will see part by part how the app.py file from my previous blog/video is modified in order to integrate Gemini, I will go chunk by chunk commenting the changes, you will need to put all of them together with proper indentation in order to make it work.
import streamlit as st
from openai import OpenAI
import google.generativeai as genai
import dotenv
import os
from PIL import Image
from audio_recorder_streamlit import audio_recorder
import base64
from io import BytesIO
import random
dotenv.load_dotenv()google_models = [
"gemini-1.5-flash",
"gemini-1.5-pro",
]
openai_models = [
"gpt-4o",
"gpt-4-turbo",
"gpt-3.5-turbo-16k",
"gpt-4",
"gpt-4-32k",
]# Function to convert the messages format from OpenAI and Streamlit to Gemini
def messages_to_gemini(messages):
gemini_messages = []
prev_role = None
for message in messages:
if prev_role and (prev_role == message["role"]):
gemini_message = gemini_messages[-1]
else:
gemini_message = {
"role": "model" if message["role"] == "assistant" else "user",
"parts": [],
}
for content in message["content"]:
if content["type"] == "text":
gemini_message["parts"].append(content["text"])
elif content["type"] == "image_url":
gemini_message["parts"].append(base64_to_image(content["image_url"]["url"]))
elif content["type"] == "video_file":
gemini_message["parts"].append(genai.upload_file(content["video_file"]))
elif content["type"] == "audio_file":
gemini_message["parts"].append(genai.upload_file(content["audio_file"]))
if prev_role != message["role"]:
gemini_messages.append(gemini_message)
prev_role = message["role"]
return gemini_messages# Function to query and stream the response from the LLM
def stream_llm_response(model_params, model_type="openai", api_key=None):
response_message = ""
if model_type == "openai":
client = OpenAI(api_key=api_key)
for chunk in client.chat.completions.create(
model=model_params["model"] if "model" in model_params else "gpt-4o",
messages=st.session_state.messages,
temperature=model_params["temperature"] if "temperature" in model_params else 0.3,
max_tokens=4096,
stream=True,
):
chunk_text = chunk.choices[0].delta.content or ""
response_message += chunk_text
yield chunk_text
elif model_type == "google":
genai.configure(api_key=api_key)
model = genai.GenerativeModel(
model_name = model_params["model"],
generation_config={
"temperature": model_params["temperature"] if "temperature" in model_params else 0.3,
}
)
gemini_messages = messages_to_gemini(st.session_state.messages)
print("st_messages:", st.session_state.messages)
print("gemini_messages:", gemini_messages)
for chunk in model.generate_content(gemini_messages):
chunk_text = chunk.text or ""
response_message += chunk_text
yield chunk_text
st.session_state.messages.append({
"role": "assistant",
"content": [
{
"type": "text",
"text": response_message,
}
]})# Function to convert file to base64
def get_image_base64(image_raw):
buffered = BytesIO()
image_raw.save(buffered, format=image_raw.format)
img_byte = buffered.getvalue()
return base64.b64encode(img_byte).decode('utf-8')
def file_to_base64(file):
with open(file, "rb") as f:
return base64.b64encode(f.read())
def base64_to_image(base64_string):
base64_string = base64_string.split(",")[1]
return Image.open(BytesIO(base64.b64decode(base64_string)))def main():
# --- Page Config ---
st.set_page_config(
page_title="The OmniChat",
page_icon="🤖",
layout="centered",
initial_sidebar_state="expanded",
)
# --- Header ---
st.html("""<h1 style="text-align: center; color: #6ca395;">🤖 <i>The OmniChat</i> 💬</h1>""")
# --- Side Bar ---
with st.sidebar:
cols_keys = st.columns(2)
with cols_keys[0]:
default_openai_api_key = os.getenv("OPENAI_API_KEY") if os.getenv("OPENAI_API_KEY") is not None else "" # only for development environment, otherwise it should return None
with st.popover("🔐 OpenAI"):
openai_api_key = st.text_input("Introduce your OpenAI API Key (https://platform.openai.com/)", value=default_openai_api_key, type="password")
with cols_keys[1]:
default_google_api_key = os.getenv("GOOGLE_API_KEY") if os.getenv("GOOGLE_API_KEY") is not None else "" # only for development environment, otherwise it should return None
with st.popover("🔐 Google"):
google_api_key = st.text_input("Introduce your Google API Key (https://aistudio.google.com/app/apikey)", value=default_google_api_key, type="password")
# --- Main Content ---
# Checking if the user has introduced the OpenAI API Key, if not, a warning is displayed
if (openai_api_key == "" or openai_api_key is None or "sk-" not in openai_api_key) and (google_api_key == "" or google_api_key is None):
st.write("#")
st.warning("⬅️ Please introduce an API Key to continue...") else:
client = OpenAI(api_key=openai_api_key)
if "messages" not in st.session_state:
st.session_state.messages = []
# Displaying the previous messages if there are any
for message in st.session_state.messages:
with st.chat_message(message["role"]):
for content in message["content"]:
if content["type"] == "text":
st.write(content["text"])
elif content["type"] == "image_url":
st.image(content["image_url"]["url"])
elif content["type"] == "video_file":
st.video(content["video_file"])
elif content["type"] == "audio_file":
st.audio(content["audio_file"]) # Side bar model options and inputs
with st.sidebar:
st.divider()
available_models = [] + (google_models if google_api_key else []) + (openai_models if openai_api_key else [])
model = st.selectbox("Select a model:", available_models, index=0)
model_type = None
if model.startswith("gpt"): model_type = "openai"
elif model.startswith("gemini"): model_type = "google"
with st.popover("⚙️ Model parameters"):
model_temp = st.slider("Temperature", min_value=0.0, max_value=2.0, value=0.3, step=0.1)
audio_response = False
if openai_api_key:
audio_response = st.toggle("Audio response", value=False)
if audio_response:
cols_audio = st.columns(2)
with cols_audio[0]:
tts_voice = st.selectbox("Select a voice:", ["alloy", "echo", "fable", "onyx", "nova", "shimmer"])
with cols_audio[1]:
tts_model = st.selectbox("Select a model:", ["tts-1", "tts-1-hd"], index=1)
model_params = {
"model": model,
"temperature": model_temp,
}
def reset_conversation():
if "messages" in st.session_state and len(st.session_state.messages) > 0:
st.session_state.pop("messages", None)
st.button(
"🗑️ Reset conversation",
on_click=reset_conversation,
)
st.divider()
# Image Upload
if model in ["gpt-4o", "gpt-4-turbo", "gemini-1.5-flash", "gemini-1.5-pro"]:
st.write(f"### **🖼️ Add an image{' or a video file' if model_type=='google' else ''}:**")
def add_image_to_messages():
if st.session_state.uploaded_img or ("camera_img" in st.session_state and st.session_state.camera_img):
img_type = st.session_state.uploaded_img.type if st.session_state.uploaded_img else "image/jpeg"
if img_type == "video/mp4":
# save the video file
video_id = random.randint(100000, 999999)
with open(f"video_{video_id}.mp4", "wb") as f:
f.write(st.session_state.uploaded_img.read())
st.session_state.messages.append(
{
"role": "user",
"content": [{
"type": "video_file",
"video_file": f"video_{video_id}.mp4",
}]
}
)
else:
raw_img = Image.open(st.session_state.uploaded_img or st.session_state.camera_img)
img = get_image_base64(raw_img)
st.session_state.messages.append(
{
"role": "user",
"content": [{
"type": "image_url",
"image_url": {"url": f"data:{img_type};base64,{img}"}
}]
}
)
cols_img = st.columns(2)
with cols_img[0]:
with st.popover("📁 Upload"):
st.file_uploader(
f"Upload an image{' or a video' if model_type == 'google' else ''}:",
type=["png", "jpg", "jpeg"] + (["mp4"] if model_type == "google" else []),
accept_multiple_files=False,
key="uploaded_img",
on_change=add_image_to_messages,
)
with cols_img[1]:
with st.popover("📸 Camera"):
activate_camera = st.checkbox("Activate camera (only images)")
if activate_camera:
st.camera_input(
"Take a picture",
key="camera_img",
on_change=add_image_to_messages,
)
# Audio Upload
st.write("#")
st.write(f"### **🎤 Add an audio{' (Speech To Text)' if model_type == 'openai' else ''}:**")
audio_prompt = None
audio_file_added = False
if "prev_speech_hash" not in st.session_state:
st.session_state.prev_speech_hash = None
speech_input = audio_recorder("Press to talk:", icon_size="3x", neutral_color="#6ca395", )
if speech_input and st.session_state.prev_speech_hash != hash(speech_input):
st.session_state.prev_speech_hash = hash(speech_input)
if model_type == "openai":
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=("audio.wav", speech_input),
)
audio_prompt = transcript.text
elif model_type == "google":
# save the audio file
audio_id = random.randint(100000, 999999)
with open(f"audio_{audio_id}.wav", "wb") as f:
f.write(speech_input)
st.session_state.messages.append(
{
"role": "user",
"content": [{
"type": "audio_file",
"audio_file": f"audio_{audio_id}.wav",
}]
}
)
audio_file_added = True # Chat input
if prompt := st.chat_input("Hi! Ask me anything...") or audio_prompt or audio_file_added:
if not audio_file_added:
st.session_state.messages.append(
{
"role": "user",
"content": [{
"type": "text",
"text": prompt or audio_prompt,
}]
}
)
# Display the new messages
with st.chat_message("user"):
st.markdown(prompt)
else:
# Display the audio file
with st.chat_message("user"):
st.audio(f"audio_{audio_id}.wav")
with st.chat_message("assistant"):
st.write_stream(
stream_llm_response(
model_params=model_params,
model_type=model_type,
api_key=openai_api_key if model_type == "openai" else google_api_key)
)
# --- Added Audio Response (optional) ---
if audio_response:
response = client.audio.speech.create(
model=tts_model,
voice=tts_voice,
input=st.session_state.messages[-1]["content"][0]["text"],
)
audio_base64 = base64.b64encode(response.content).decode('utf-8')
audio_html = f"""
<audio controls autoplay>
<source src="data:audio/wav;base64,{audio_base64}" type="audio/mp3">
</audio>
"""
st.html(audio_html)if __name__=="__main__":
main()And we can save the app.py file to run it from the terminal and see if our evolved app works:
# activate the venv if needed
# make sure to install all requeriments.txt dependencies
streamlit run app.py
In the last blog/video we created a GitHub repo and we deployed it to the Streamlit Community Cloud alredy. Now, if we commit and push the added changes to our repo, it will automatically trigger the update pipeline from the Streamlit Cloud so we will see the changes in some seconds:
git add .
git commit -m "Add Gemini 1.5 models and adapt workflow for them"
git push
For some reason apparently related to the Gemini API or the communication between Streamlit Cloud and it, videos normally fail in the online app.
I hope you enjoyed this content and learnt how to create amazing online AI apps from it. Consider leaving an applause, like and subscribe if so! 🤗
See you in the next one!! 🤖🚀
Bernard BuildsHalf a million impressions, thousands of clicks, and top of Google rankings
Chandler KEverything you need to know in one place
As we all know today, not all language models (LLMs) are created equal; some models are more resource-intensive than others. Factors such…
Jim Clyde MongeClaude Dev is an autonomous software engineer right in your IDE. Open source and available on VSCode marketplace now.