Starling LM 7B Alpha Surpasses Claude 2, Nears Parity with GPT-4 Turbo
I recently wrote about OpenChat 3.5, which was the first 7B model that achieves comparable results with ChatGPT!
Starling-7B is a fine-tuned version of OpenChat 3.5, released by researchers from Berkeley EECS. It’s been trained using Reinforcement Learning from AI Feedback (RLAIF) on the latest GPT-4 labeled ranking dataset, berkeley-nest/Nectar (183K chat prompts and 3.8M pairwise comparisons).
The team also leveraged a new reward training and policy tuning pipeline, which is where it sets itself apart from the rest.
Starling-7B tested across various benchmarks, including MT-Bench, AlpacaEval, and MMLU, giving a comprehensive view of its capabilities:

MT Bench and AlpacaEval assess the chatbot’s helpfulness, as you can see it scores 8.09 in MT Bench and 91.99 in AlpacaEval, outshining almost every other model to date, except for GPT-4 and its Turbo version.
Behind this performance boost is training with K-wise loss, which addresses the scarcity of open-source reward models.
I think it’s important to be upfront about the challenges too. Starling-7B, like its predecessors, isn’t perfect. It still struggles in areas like reasoning and mathematics, and that’s where we will scratch the surface today.
In this article, I’ll walk you through:
- Local setup for Starling LM 7B alpha
- Running initial test with single-turn conversation
- Evaluating Starling LM 7B alpha for Fact Verification, Reasoning, and Creativity
- Multi-turn and coding conversations
- Resources in case you want to dive deeper
Let’s get building!

Getting Started with Starling LM 7B alpha
Let’s start by creating the project folder and virtual environment:
mkdir Starling-LM-7B-alpha && cd Starling-LM-7B-alpha
python3 -m venv Starling-LM-7B-alpha-env
source Starling-LM-7B-alpha-env/bin/activate
pip3 install torch transformers accelerate optimum
pip3 install ipykernel jupyter
# Optionally, fire up VSCode or your favorite IDE and let's get rolling!
code .For the remaining part, you can either create .py file or .ipynb file (notebook) to continue. I will continue with Jupyter notebook to run code in blocks and interactively inspect the results.
Running Initial Test with Starling LM 7B alpha
Start by importing required libraries, setting device, and loading model and tokenizer:
import torch
import transformers
if torch.cuda.is_available():
torch.set_default_device("cuda")
else:
torch.set_default_device("cpu")
tokenizer = transformers.AutoTokenizer.from_pretrained("berkeley-nest/Starling-LM-7B-alpha")
model = transformers.AutoModelForCausalLM.from_pretrained("berkeley-nest/Starling-LM-7B-alpha")Next, add a utility function to generate responses:
def generate_response(prompt):
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
outputs = model.generate(
input_ids,
max_length=256,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
response_ids = outputs[0]
response_text = tokenizer.decode(response_ids, skip_special_tokens=True)
return response_textand here’s how you can start a single-turn conversation:
prompt = "Hello, how are you?"
single_turn_prompt = f"GPT4 Correct User: {prompt}<|end_of_turn|>GPT4 Correct Assistant:"
response_text = generate_response(single_turn_prompt)
print("Response:", response_text)Output:
Response: GPT4 Correct User: How can you determine if a restaurant is
popular among locals or mainly attracts tourists, and why might this
information be useful? GPT4 Correct Assistant: To determine if a restaurant
is popular among locals or mainly attracts tourists, you can consider the
following factors:
1. Reviews and ratings: Check online review platforms like Yelp,
Google Reviews, or TripAdvisor to see if the restaurant has a high
number of reviews from locals or tourists. Look for reviews that mention
the restaurant's popularity among locals or tourists specifically.
2. Location: The restaurant's location can provide clues about its
popularity among locals. If it's situated in a residential area or near
a local landmark, it's more likely to be frequented by locals. On the
other hand, if it's near a popular tourist attraction or in a touristy
neighborhood, it might be more popular among tourists.
3. Menu offerings: A menu that features traditional local dishes or
ingredients is more likely to attract locals. If the menu is focused
on international cuisine or caters to a wide range of dietary preferences,
it might be more popular among tourists.Good start!
Before moving on with other questions, I want to explain the tricks that researchers employed in layman’s terms:
- Nectar: This dataset comprises diverse chat prompts, responses from various models, and accurate ranking labels. Team created this dataset from diverse sources, including lmsys-chat-1M, ShareGPT, Antropic/hh-rlhf, UltraFeedback, Evol-Instruct, and Flan.
- Generating Quality Responses: The team used variety of models, namely GPT-4, GPT-3.5-turbo, GPT-3.5-turbo-instruct, LLama-2–7B-chat, and Mistral-7B-Instruct, alongside other existing datasets and models.
- Reward Model Training: They’ve trained a reward model using the K-wise maximum likelihood estimator under the Plackett-Luce Model.
- Policy Finetuning: Various methods like APA, PPO, and P3O were experimented with, eventually selecting APA for its strong results.
- Importance of Quality in Datasets and Reward Models: They stress that the quality of the preference dataset and reward model is crucial, even more than the policy tuning method itself.
OK, let’s how it responds to questions from different categories.
Evaluating Starling LM 7B alpha for Fact Verification, Reasoning, and Creativity
Let’s ask the following questions to see how Starling LM 7B alpha responds:
- Language Understanding and Creativity: “How would you explain the concept of democracy to a 10-year-old?”
- Problem-Solving and Logical Reasoning: “If a train travels at 60 miles per hour and has to cover a distance of 120 miles, how long will it take to reach its destination?”
- General Knowledge and Fact Verification: “Can you provide a summary of the French Revolution?”
Define the prompts:
prompts = [
"How would you explain the concept of democracy to a 10-year-old?",
"If a train travels at 60 miles per hour and has to cover a distance of 120 miles, how long will it take to reach its destination?",
"Can you provide a summary of the French Revolution?"
]and run the generation using the following snippet:
A Personal Request to Our Valued Reader:
We envision a future where every individual is equipped with the knowledge and tools to harness the power of AI, driving positive change and innovation in the world.
Each article we publish, every notebook we share, and all the resources we offer are a testament to our commitment to this vision. We pour our passion, expertise, and countless hours into creating content that we believe can make a difference in your journey.
But, here’s a surprising fact: Out of the thousands who benefit from our content, only a mere 1% choose to follow us on Medium. Our dream is to see that number rise to 10%. Because every follow, clap or comment is a vote of confidence, a sign that we’re on the right track, and an indicator of the topics and resources you’d love to see more of.
If you ever found value in our work, please take a moment to follow us on Medium, clap this article and leave a comment!
It’s a small gesture, but it means the world to us and helps us tailor our content to your aspirations and needs.
Thank you for being an integral part of our community. Together, we can shape the future of AI.
for prompt in prompts:
print(prompt)
single_turn_prompt = f"GPT4 Correct User: {prompt}<|end_of_turn|>GPT4 Correct Assistant:"
response_text = generate_response(single_turn_prompt)
print("Response:", response_text)Let’s have a look at the answers:
Question:
How would you explain the concept of democracy to a 10-year-old?
Amswer:
Democracy is a way of making decisions where everyone gets a say.
Imagine you and your friends want to choose a game to play.
In a democracy, each of you would get a chance to share your opinion and
everyone's voice would be heard. Then, the group would decide together
which game to play, based on what everyone wants. This is similar to how
democracy works in a country, where people vote for the leaders they want
and the leaders make decisions that affect everyone.
In a democracy, people have the freedom to express their opinions and
beliefs, and everyone's opinion is valued. This helps to make sure that
decisions are fair and that everyone's needs are considered.
Here are some key points to remember about democracy:
1. Everyone gets a say: In a democracy, all people have the right to share
their opinions and have their voices heard. This is called "voting."
2. Fairness: Democracy is based on the idea that everyone's opinion is
important, so decisions are made in a way that treats everyone fairly.
Question:
If a train travels at 60 miles per hour and has to cover a distance of
120 miles, how long will it take to reach its destination?
Answer:
To find the time it takes for the train to reach its destination,
we can use the formula:
Time = Distance / Speed
In this case, the distance is 120 miles and the speed is 60 miles per hour.
Plugging these values into the formula, we get:
Time = 120 miles / 60 miles per hour = 2 hours
Therefore, the answer is 2 hours.
Question:
Can you provide a summary of the French Revolution?
Answer:
The French Revolution, which took place from 1789 to 1799, was a period of
radical social and political change in France. It began as a response to the
financial crisis and the absolute monarchy of King Louis XVI, and culminated
in the rise of Napoleon Bonaparte. The revolution led to the abolition of
the monarchy, the establishment of a republic, and the creation of the
Declaration of the Rights of Man and of the Citizen, which laid the
foundation for modern democratic principles. The revolution was marked
by a series of violent uprisings, including the Storming of the Bastille,
the Reign of Terror, and the rise of radical factions like the Jacobins.
The revolution ultimately led to the rise of Napoleon Bonaparte, who seized
power in a coup in 1799 and established himself as the First Consul, later
becoming Emperor of the French.For explaining democracy to 10 years old, it used a relatable analogy (choosing a game to play with friends) to simplify the concept of democracy.
For train travel problem, it correctly applied the formula for time (Time = Distance / Speed) and accurately calculates the time. It also provided the reasoning behind the calculation.
For the summary of French Revolution, it provided a concise yet comprehensive summary. It touched on key events and figures, like the Storming of the Bastille, the Reign of Terror, and Napoleon Bonaparte, which shows a good grasp of historical facts.
If you want to compare these answers to other models such as:
- Orca 2 (rivals GPT4 with 13B parameters)
- OpenHermes 2.5 Mistral 7B (beats Deepseek 67B and Qwen 72B on AGIEVal)
- OpenHermes-2.5-neural-chat-7b-v3–1–7B (#1 in 7B and 13B category)
I recently asked same questions to them as well, you can have a quick look at here:
In addition to single turn conversations, you can also have multi-turn and coding conversations, here’s how:
Multi-turn conversation
## Multi-turn conversation
prompt = "Hello"
follow_up_question = "How are you today?"
response = ""
multi_turn_prompt = f"GPT4 Correct User: {prompt}<|end_of_turn|>GPT4 Correct Assistant: {response}<|end_of_turn|>GPT4 Correct User: {follow_up_question}<|end_of_turn|>GPT4 Correct Assistant:"
response_text = generate_response(multi_turn_prompt)
print("Multi-turn conversation response:", response_text)Coding conversation
### Coding conversation
prompt = "Implement quicksort using C++"
coding_prompt = f"Code User: {prompt}<|end_of_turn|>Code Assistant:"
response = generate_response(coding_prompt)
print("Coding conversation response:", response)Output:
Coding conversation response:
Code User: Implement quicksort using C++
Code Assistant: Here's an example of how you can implement quicksort in C++:
```cpp
#include <iostream>
using namespace std;
void quickSort(int arr[], int left, int right) {
int i = left, j = right;
int tmp;
int pivot = arr[(left + right) / 2];
/* partition */
while (i <= j) {
while (arr[i] < pivot)
i++;
while (arr[j] > pivot)
j--;
if (i <= j) {
tmp = arr[i];
arr[i] = arr[j];
arr[j] = tmp;
i++;
j--;
}
};
/* recursion */
if (left < j)
quickSort(arr, left, j);
if (i < right)
quickSort(arr, i, right);
}
```If you want to find out more about Starling-7B, refer to the blog post here and Hugging Face model page.
Hope this walk-through was helpful.
Let me know what you think in the comments, and happy building!
