Did Mixtral 8x7B Really Beat GPT 3.5 Turbo? Performance Comparison on 15 Questions for Reasoning, Logic and Coding Tasks
A year later, after GPT-3.5 was released, social media is buzzing with an open-source LLM that matches and even outperforms GPT-3.5!
Even better, we can run it up to 100 token/s for $0.0002/1K tokens, or run entirely within two 80GB GPUs.
Everybody says the numbers don’t lie, but is that always the case?
We will try to solve 15 questions/tasks from Reasoning, Logic, and Coding categories, and this will show us whether the recent news about Mixtral is a few steps ahead of reality or not!
Let’s go through it together:
- Setting up API access for Mixtral 8x7B, GPT-3.5, and GPT-4
- Running initial tests with API endpoints from your local environment
- Generating answers to questions in Reasoning, Logic, and Coding, and saving to a CSV for further analysis
- Comparing Mixtral 8x7B in Reasoning, Logic, and Coding to GPT-4 and GPT-3.5-Turbo (Google Sheet provided)
- Comparing pricing between providers
- Integration with LangChain and chat history management
- Resources to dig deeper
Roll up your sleeves, this will be a fun ride!

The Cool Stuff About Sparse Mixture of Experts
Before, I will walk you through the entire setup, let’s briefly have a look at Mixtral’s features.
- Mixtral is a sparse mixture-of-experts (MoE) with ~50B parameters, but latency of 13B model for inference!
- It has a context lenght of 32k tokens.
- It outperforms Llama 2 70B on most benchmarks with 6x faster inference.
- Matches or beats GPT3.5 on most benchmarks.
- Supports English, French, German, Spanish, and Italian
- The instruct model uses SFT followed by DPO and achieves a top score of 8.3 on MT-Bench
- Good at coding, with 40.2% on HumanEval
Really cool stuff, let’s see it in action.
Getting Started with TogetherAI
Visit TogetherAI and sign up to receive $25 in free credits.
The platform offers various capabilities to work with LLMs, you can see some of those in the left navigation pane.

You can find both your API Keys and Billing information in Settings.

You can test different capabilities on UI, but what we really want is to set up programmatic access.
Setting up Environment
Let’s start by setting up our virtual environment:
# Create a virtual environment
mkdir llm-benchmark && cd llm-benchmark
python3 -m venv llm-benchmark-env
source llm-benchmark-env/bin/activate
# Install dependencies
pip3 install ipykernel jupyter
pip3 install python-dotenv
pip3 install --upgrade together
pip3 install openai
pip3 install langchain
pip3 install litellm click tqdm tabulate termcolor
pip3 install pandas
# Clone LLM-Bench repository
git clone https://github.com/BerriAI/litellm
# Optionally, fire up VSCode or your favorite IDE and let's get rolling!
code .Next, in your root directory, create .env file and add your TogetherAI and OpenAI API Key:
TOGETHER_API_KEY=<Your API Key>
OPENAI_API_KEY=<Your API KEY>To continue, you can either create .py file or .ipynb file (notebook). I will continue with Jupyter notebook to run code in blocks and interactively inspect the results.
Getting Familiar with TogetherAI Python API
Start by importing required libraries.
import os
import time
import json
import logging
import pandas as pd
from datetime import datetime
import together
import openai
from langchain.llms.base import LLM
from langchain import PromptTemplate, LLMChain
# The dotenv library's load_dotenv function reads a .env file
# to load environment variables into the process environment.
# This is a common method to handle configuration settings securely.
from dotenv import load_dotenv
# Load env variables
load_dotenv()
# Set up logging
logging.basicConfig(level=logging.INFO)You can now have a look at the supported models.
model_list = together.Models.list()
print(f"There are {len(model_list)} models to choose from!")
[model['name'] for model in model_list][:20]There are 120 models (and growing) to choose from, here’s the list of first 20:
['Austism/chronos-hermes-13b',
'DiscoResearch/DiscoLM-mixtral-8x7b-v2',
'EleutherAI/llemma_7b',
'Gryphe/MythoMax-L2-13b',
'Meta-Llama/Llama-Guard-7b',
'Nexusflow/NexusRaven-V2-13B',
'NousResearch/Nous-Capybara-7B-V1p9',
'NousResearch/Nous-Hermes-Llama2-13b',
'NousResearch/Nous-Hermes-Llama2-70b',
'NousResearch/Nous-Hermes-llama-2-7b',
'NumbersStation/nsql-llama-2-7B',
'Open-Orca/Mistral-7B-OpenOrca',
'Phind/Phind-CodeLlama-34B-Python-v1',
'Phind/Phind-CodeLlama-34B-v2',
'SG161222/Realistic_Vision_V3.0_VAE',
'Undi95/ReMM-SLERP-L2-13B',
'Undi95/Toppy-M-7B',
'WizardLM/WizardCoder-15B-V1.0',
'WizardLM/WizardCoder-Python-34B-V1.0',
'WizardLM/WizardLM-13B-V1.2']We are interested in ‘mistralai/Mixtral-8x7B-Instruct-v0.1, but feel free to play with other models too.
What’s also great about TogetherAI is that if you are transitioning from OpenAI, you just need to add your TOGETHER_API_KEY, change the base URL to https://api.together.xyz, and swap the model name.
Let’s see that in action:
client = openai.OpenAI(
api_key=os.environ.get("TOGETHER_API_KEY"),
base_url='https://api.together.xyz',
)
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Tell me about Mixture of Experts",
}
],
model="mistralai/Mixtral-8x7B-Instruct-v0.1",
max_tokens=1024
)
print(chat_completion.choices[0].message.content)Which will output:
INFO:httpx:HTTP Request: POST https://api.together.xyz/chat/completions "HTTP/1.1 200 OK"Mixture of Experts (MoE) is a type of machine learning model that combines
the predictions of multiple simpler models, or "experts," to make more
accurate and flexible predictions. The idea behind MoE is to divide the
input space into regions, or "expertise areas," and to have each expert
specialize in making predictions for a particular region.At a high level, a MoE model consists of two main components: a gating
network and a set of expert networks. The gating network is responsible
for determining which expert should be used to make a prediction for a
given input. It does this by computing a set of weights, or "gating scores,"
that indicate the relative expertise of each expert for the input. The expert
networks, on the other hand, are responsible for making the actual predictions. Each expert network is trained to specialize in making predictions for a particular region of the input space.During the training process, the MoE model is presented with a set of
input-output pairs and learns to adjust the gating scores and the parameters
of the expert networks to minimize the overall prediction error. This allows
the model to learn a set of experts that are well-suited to making
predictions for different regions of the input space.One of the key benefits of MoE models is that they can be more accurate
and flexible than a single, monolithic model. This is because they are
able to adapt to the structure of the data and make more specialized
predictions for different regions of the input space. Additionally, MoE
models can be more efficient to train and use than a single, large model,
because the gating network can select a small number of experts to use for
a given input.MoE models have been applied to a wide range of machine learning tasks,
including regression, classification, and density estimation. They have
been shown to be particularly effective in settings where the data is
high-dimensional, noisy, or has a complex structure.Output has 411 tokens and I got this response roughly in 4.8 seconds, so it’s 85 tokens/sec which is close to 100 tokens/sec.
You can do the same thing by using TogetherAI’s python library too:
prompt = "<|im_start|>user\nTell me about Mixture of Experts<|im_end|>\n<|im_start|>assistant"
model = "mistralai/Mixtral-8x7B-Instruct-v0.1"
output = together.Complete.create(
prompt = prompt,
model = model,
max_tokens = 512,
temperature = 0.7,
top_k = 50,
top_p = 0.7,
repetition_penalty = 1,
stop = ["<|im_end|>", "<|im_start|>"]
)
print(json.dumps(output, indent = 4))Which will output:
{
"id": "8354ffbaa9560b85-AMS",
"status": "finished",
"prompt": [
"<|im_start|>user\nTell me about Mixture of Experts<|im_end|>\n<|im_start|>assistant"
],
"model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
"model_owner": "",
"num_returns": 1,
"args": {
"model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
"prompt": "<|im_start|>user\nTell me about Mixture of Experts<|im_end|>\n<|im_start|>assistant",
"top_p": 0.7,
"top_k": 50,
"temperature": 0.7,
"max_tokens": 512,
"stop": [
"<|im_end|>",
"<|im_start|>"
],
"repetition_penalty": 1,
"logprobs": null,
"safety_model": null
},
"subjobs": [],
"output": {
"usage": {
"prompt_tokens": 35,
"completion_tokens": 182,
"total_tokens": 217
},
"result_type": "language-model-inference",
"choices": [
{
"text": "\nMixture of Experts (MoE) is a type of machine learning model that combines the predictions of multiple expert models to make a final prediction. The idea is to have a group of models that are each good at different things, and then use a gating network to decide which model to use for a given input.\n\nThis approach can be particularly useful in situations where the data is complex and no single model can effectively handle all aspects of it. By using a mixture of experts, the model can adapt to the different parts of the data, improving its overall performance.\n\nIn practice, Mixture of Experts models can be quite complex, involving many different models and a sophisticated gating network. However, they have been shown to be very effective in a variety of applications, including natural language processing, speech recognition, and computer vision.\n"
}
]
}
}and here’s how you can get the generated response:
print(output['output']['choices'][0]['text'])
# Mixture of Experts (MoE) is a type of machine learning model that ...You can also stream responses in a similar way:
for token in together.Complete.create_streaming(
prompt = prompt,
model = model,
max_tokens = 512,
temperature = 0.7,
top_k = 50,
top_p = 0.7,
repetition_penalty = 1,
stop = ["<|im_end|>", "<|im_start|>"]
):
print(token, end="", flush=True)Let’s have a look at some comparisons.
Comparing Mixtral 8x7B in Reasoning, Logic, and Coding to GPT-4 and GPT-3.5-Turbo
We will ask 5 questions per each category in resoning, logic and coding, in that order, then we will compare a few answers and you can find the Google sheet at the end to see full comparison.
questions = [
# Reasoning
"A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?",
"If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets?",
"In a lake, there is a patch of lily pads. Every day, the patch doubles in size. If it takes 48 days for the patch to cover the entire lake, how long would it take for the patch to cover half of the lake?",
"If you're running a race and you pass the person in 2nd place, what place are you in now?",
"Two fathers and two sons go fishing. Each of them catches one fish. So why do they bring home only three fish?",
# Logic
"All roses are flowers. Some flowers fade quickly. Therefore, some roses fade quickly. Is this statement true, false, or uncertain?",
"If some A are B, and some B are C, does it follow that some A are C?",
"No birds are dogs. All dogs are mammals. Therefore, no birds are mammals. Is this logical conclusion correct?",
"If it is true that all men are mortal and Socrates is a man, what can be concluded?",
"Whenever it rains, the grass gets wet. It's not raining. Can we conclude that the grass is not wet?",
# Coding
"Write a function in Python that checks if a number is prime.",
"How would you reverse a string in JavaScript?",
"Write a SQL query to find the second highest salary from a `Employee` table.",
"How can you swap two variables `x` and `y` in Python without using a third variable?",
"Write a recursive function in any programming language to compute the factorial of a number."
]Here’s a little snippet that we will use to generate the answers:
togetherai_client = openai.OpenAI(
api_key=os.environ.get("TOGETHER_API_KEY"),
base_url='https://api.together.xyz',
)
openai_client = openai.OpenAI()
models = [
{ "client": openai_client, "model": "gpt-4", "answers": []},
{ "client": openai_client, "model": "gpt-3.5-turbo", "answers": []},
{ "client": togetherai_client, "model": "mistralai/Mixtral-8x7B-Instruct-v0.1", "answers": []},
]
for question in questions:
for model in models:
completion = model["client"].chat.completions.create(
messages=[
{
"role": "user",
"content": question,
}
],
model=model["model"],
max_tokens=1024
)
model["answers"].append(
{
"question": question,
"answer": completion.choices[0].message.content}
)
print(f"""
------------------
{model["model"]}
{question}
{completion.choices[0].message.content}
------------------
""")This will take a while to run, but you can inspect the logs while it’s running:
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" ------------------
gpt-4
A bat and a ball cost $1.10 in total...Once it’s finished, we can now easily convert this into a pandas dataframe and save it as a csv:
df = pd.DataFrame(models).explode('answers')
df = pd.concat([df.drop(['answers'], axis=1), df['answers'].apply(pd.Series)], axis=1)
df.to_csv("gpt-35-turbo_vs_mixtral-8x7b-instruct-comparison.csv")
df.head()
If you look at the answers, you will see that all of these models got the reasoning questions right, for example for the question “Two fathers and two sons go fishing. Each of them catches one fish. So why do they bring home only three fish?”, all of the models answered correctly:
------------------
gpt-4
Two fathers and two sons go fishing. Each of them catches one fish. So why do they bring home only three fish?
Because the group includes a grandfather, his son, and his grandson. So there are only three people (two fathers and two sons) in total. Each catches one fish, which means they bring home three fish in total.
------------------
------------------
gpt-3.5-turbo
Two fathers and two sons go fishing. Each of them catches one fish. So why do they bring home only three fish?
The group consists of two fathers and their respective sons. Therefore, there are only three individuals in total: the two fathers and one son. Each person caught one fish, resulting in a total of three fish being brought home.
------------------
------------------
mistralai/Mixtral-8x7B-Instruct-v0.1
Two fathers and two sons go fishing. Each of them catches one fish. So why do they bring home only three fish?
Because there are only three people: a grandfather, his son, and his grandson. The son is both a father and a son.
------------------But if you look at the logic questions, you will start to see some differences.
For example for the question “All roses are flowers. Some flowers fade quickly. Therefore, some roses fade quickly. Is this statement true, false, or uncertain?”:
- gpt-4 stated “uncertain” — response: The statement is uncertain. While we know all roses are flowers and some flowers fade quickly, we do not have specific information about roses fading quickly.
- gpt-3.5-turbo stated “true” — response: The statement is true.
- mixtral stated “uncertain” — response: The statement is not necessarily true. While all roses are indeed flowers, and some flowers do fade quickly, it does not automatically follow that some roses fade quickly. This would be assuming that only roses can belong to the group of flowers that fade quickly, which is not specified in the given premises. The argument commits the fallacy of undistributed middle. The statement is uncertain, as more information is needed to determine if some roses fade quickly.
Mixtral wins here, but for the question “No birds are dogs. All dogs are mammals. Therefore, no birds are mammals. Is this logical conclusion correct?”:
- gpt-4 stated “no” — response: No, this logical conclusion is incorrect. Just because no birds are dogs doesn’t mean birds can’t be another type of mammal. Even though all dogs are mammals, this does not connect to the idea that no birds are mammals. In fact, birds are a separate category of animals — they are not classified as mammals.
- gpt-3.5-turbo stated “no” — response: No, the logical conclusion is incorrect. Just because all birds are not dogs and all dogs are mammals, it does not necessarily mean that no birds are mammals. It is possible for birds to be mammals too, as there are various other mammalian species apart from dogs.
- mixtral stated “correct” — response: The logical conclusion is correct based on the given premises. If no birds are dogs and all dogs are mammals, it follows that no birds are mammals. This is a valid form of deductive reasoning known as a syllogism. However, it’s important to note that the conclusion is only as accurate as the premises upon which it’s based. In this case, the premises are true, so the conclusion is also true.
As you can see, both models have limitations in terms of following logical arguments.
Let’s have a look at some coding questions.
A Personal Request to Our Valued Reader:
Each article we publish, every notebook we share, and all the resources we offer — they are part of our grand vision, a commitment to a world where everyone has skills and tools to put AI to work for themselves, driving positive change and innovation in their lives.
We pour our passion, expertise, and countless hours into creating content that we believe can make a difference in your journey.
But, here’s a surprising fact: Out of the thousands who benefit from our content, only a mere 1% choose to follow us on Medium.
Our dream is to see that number rise to 10%.
Your follow, clap, or comment isn’t just a click to us, it’s a sign that we’re on the right track, it’s your way of steering the ship, letting us know what you want more of.
So if you ever found value in our work, please take a moment to follow us on Medium, clap this article and leave a comment — it’s a small gesture, but it means the world to us and helps us tailor our content to your aspirations.
Thank you for stopping by, and being an integral part of our community. Together, we can shape the world.
For the task, “Write a SQL query to find the second highest salary from a `Employee` table.”, I think mixtral outperforms both gpt-3.5-turbo and gpt-4.
Both gpt-3.5-turbo and gpt-4 provided a single implementation to get the second highest salary, but mixtral covered cases where second highest salary position can be shared by two or more people.
Here’s mixtral implementation:
To find the second highest salary from an `Employee` table,
you can use the following SQL query:
```sql
SELECT MAX(salary) AS 'Second Highest Salary'
FROM Employee
WHERE salary < (SELECT MAX(salary) FROM Employee);
```
This query first finds the maximum salary in the `Employee` table using
the subquery `(SELECT MAX(salary) FROM Employee)`. Then, it selects the
maximum salary that is less than the overall maximum salary, which will
be the second highest salary.
Note: If there are multiple employees with the same second highest salary,
this query will return only one of them. If you want to return all employees
with the second highest salary, you can use the following query:
```sql
SELECT *
FROM Employee
WHERE salary = (
SELECT DISTINCT salary
FROM Employee
WHERE salary < (SELECT MAX(salary) FROM Employee)
ORDER BY salary DESC
LIMIT 1
);
```
This query finds the distinct second highest salary and then selects all employees with that salary.For the task, “Write a recursive function in any programming language to compute the factorial of a number.”, I particularly like the explanation of implementation details that mixtral provided:
Here is a simple recursive function in Python to compute the factorial of a
number:
```python
def factorial(n):
if n == 0:
return 1
else:
return n * factorial(n-1)
```
This function works by repeatedly calling itself with progressively smaller
values, until it reaches the base case where `n` equals 0. At this point, it
returns 1, because the factorial of 0 is defined as 1.
For any other value of `n`, the function computes `n` times the factorial
of `n-1`. For example, the factorial of 5 (denoted as 5!) is calculated
as follows:
5! = 5 * 4!
= 5 * (4 * 3!)
= 5 * (4 * (3 * 2!))
= 5 * (4 * (3 * (2 * 1!)))
= 5 * (4 * (3 * (2 * (1 * 0!))))
= 5 * (4 * (3 * (2 * (1 * 1))))
= 5 * (4 * (3 * (2 * 2)))
= 5 * (4 * (3 * 6))
= 5 * (4 * 24)
= 5 * 96
= 480
So if you call `factorial(5)`, the function will return `5 * factorial(4)`,
which in turn calls `4 * factorial(3)`, and so on, until `factorial(0)`
returns 1. Then the recursion unwinds, multiplying all the returned values
together to get the final result.You see the results, I think at this price point, mixtral has the capability to deliver, and it can provide a great performance for wide variety of applications. What do you think?
Here’s the Google Sheet that you can see all questions and answers.
Let’s have a look at the pricing per provider.
Pricing by Provider
Please note that pricings are updated very frequently, and this is as of 19/12/2023:
- MistralAI : input: 0.6€ / 1M tokens, output: 1.8€ / 1M tokens
- TogetherAI: $0.2 / 1M tokens
- PerplexityAI: input: $0.14 / 1M tokens, output: $0.56 / 1M tokens
- Anyscale: $0.50 / 1M tokens
For those who are interested, we will have a look at integrating TogetherAI with LangChain.
Integrating TogetherAI with LangChain
In order to use TogetherAI with LangChain, we will extend base LLM abstract class.
Here’s an example code for creating Custom LLM wrappers , but we will make it a little bit better with type validation, exception handling, and logging.
class TogetherLLM(LLM):
"""
Together LLM integration.
Attributes:
model (str): Model endpoint to use.
together_api_key (str): Together API key.
temperature (float): Sampling temperature to use.
max_tokens (int): Maximum number of tokens to generate.
"""
model: str = "mistralai/Mixtral-8x7B-Instruct-v0.1"
together_api_key: str = os.environ["TOGETHER_API_KEY"]
temperature: float = 0.7
max_tokens: int = 512
@property
def _llm_type(self) -> str:
"""Return type of LLM."""
return "together"
def _call(self, prompt: str, **kwargs: Any) -> str:
"""Call to Together endpoint."""
try:
logging.info("Making API call to Together endpoint.")
return self._make_api_call(prompt)
except Exception as e:
logging.error(f"Error in TogetherLLM _call: {e}", exc_info=True)
raise
def _make_api_call(self, prompt: str) -> str:
"""Make the API call to the Together endpoint."""
together.api_key = self.together_api_key
output = together.Complete.create(
prompt,
model=self.model,
max_tokens=self.max_tokens,
temperature=self.temperature,
)
logging.info("API call successful.")
return output['output']['choices'][0]['text']
langchain.llms.base module is designed to simplify the interaction with LLMs by providing a more user-friendly interface than directly implementing the full _generate method.
class langchain.llms.base.LLM is an abstract base class for LLMs, meaning it provides a template for other classes but is not meant to be instantiated itself. It aims to offer a simpler interface for working with LLMs by handling the complexities of LLMs internally, allowing users to interact with these models more easily.
__call__method allows the class to be called like a function. It checks the cache and runs the LLM on a given prompt.
We can now create a class instance of TogetherLLM:
llm = TogetherLLM(
model = model,
max_tokens = 256,
temperature = 0.8
)and then create an LLM chain:
prompt_template = "You are a friendly bot, answer the following question: {question}"
prompt = PromptTemplate(
input_variables=["question"], template=prompt_template
)
chat = LLMChain(llm=llm, prompt=prompt)Let’s start a conversation:
chat("Can AI take over developer jobs?")INFO:root:Making API call to Together endpoint.
INFO:root:API call successful.
{'question': 'Can AI take over developer jobs?',
'text': '\n\nNo, AI will not take over developer jobs. AI can assist
developers in various ways, such as automating repetitive tasks, generating
code, or analyzing data, but it will not replace human developers.
Developers are needed to design, build, and maintain complex software systems,
which require creativity, critical thinking, and problem-solving skills
that AI systems do not possess. Additionally, the field of software
development is constantly evolving, and new technologies and techniques
are constantly being developed, which requires developers to stay
up-to-date and adapt to new challenges.'}Let’s see how we can also manage chat history.
Managing Chat History
One-off questions are OK, but this is a chat model, we want to learn how to manage chat history for more coherent and context-aware interactions.
Here’s a simple diagram from LangChain documentation that shows the flow:

Memory: https://python.langchain.com/docs/modules/memory/
However, instead of using LnagChain’s abstraction, I want to reimplement the LLMChain class for you to develop a basic intuition, which will help you to debug things better moving forward.
from typing import List
class LLMChain:
def __init__(self, llm, prompt):
self.llm = llm
self.prompt = prompt
self.history: List[str] = [] # Initialize an empty list to keep track of the conversation history
def add_to_history(self, user_input: str, bot_response: str):
self.history.append(f"<human>: {user_input}")
self.history.append(f"<bot>: {bot_response}")
def generate_prompt(self, question: str) -> str:
history_str = "\n".join(self.history) # Convert the history list into a single string
return f"{history_str}\n<human>: {question}\n<bot>:"
def ask(self, question: str) -> str:
full_prompt = self.generate_prompt(question)
response = self.llm._call(full_prompt) # Assuming _call method handles the actual API call
self.add_to_history(question, response)
return responseIn this implementation, each time we call theaskmethod, the conversation history is updated with the latest exchange. The generate_prompt method constructs a new prompt that includes this history, ensuring that the context of the conversation is maintained. This approach allows the bot to have a memory of the conversation, making the interactions more coherent and context-aware.
Here’s how you can use it:
# Usage
llm = TogetherLLM(
model = model,
max_tokens = 256,
temperature = 0.8
)
prompt_template = "You are a friendly bot, answer the following question: {question}"
prompt = PromptTemplate(
input_variables=["question"], template=prompt_template
)
chat = LLMChain(llm=llm, prompt=prompt)
# Example interaction
response = chat.ask("What is the weather like today?")
print(response) # Bot's response
# The next call to chat.ask will include the previous interaction in the prompt
response = chat.ask("How can I enjoy such a weather?")
print(response)You have probably noticed that as your chat history grows, it will be hard to manage the model’s context window, and there are several strategies to deal with it, but that’s for another post.
I hope you find this walk-through helpful. If you have questions or want to share your experiences, please leave a comment below.
That’s it, here are some resources to dig deeper
See you somewhere in the matrix!






