Invisible Examples, Visible Gains: How Null-Shot Learning is Changing GPT(Paper Reading)

In the field of prompt engineering for large models, few-shot prompting and Chain of Thought (CoT) prompting have proven effective. Providing a few examples to large models offers more relevant context, serving as a reference for the model to improve its performance.
Surprisingly, tricking the model into believing there are examples, when in fact there are none, can also enhance its performance, as illustrated below:

The method involves prompting the model to reference “examples” in a section that doesn’t actually exist in the context.
Remarkably, this approach works! The paper I’m discussing today found that such a simple prompt can enhance model performance on most tasks. Compared to zero-shot, using GPT-3.5 Turbo on arithmetic reasoning tasks shows an improvement of up to 33.94%.
This type of prompt, with an empty example section, is termed ∅-shot or Null-Shot.
To echo GPT-3’s paper “Language Models are Few-Shot Learners,” the authors titled their paper: “Large Language Models are Null-Shot Learners.”
The paper’s concept is simple yet unique. Let’s delve into their experimental analysis.
Experiment and Analysis
To assess the performance of ∅-shot prompting, the main experiment employed LLM models like PaLM 2, PaLM 2 for Chat, GPT-3.5 Turbo, and GPT-4 Turbo, with Llama 2 7B and Llama 2 7B Chat used for extended analysis. The evaluation compared ∅-shot and zero-shot prompting across six tasks in eight datasets, as shown below:


- GPT-3.5 Turbo exhibited the most significant performance improvement using ∅-shot prompts on arithmetic reasoning tasks within the same dataset, with increases of 33.94% and 15.19% on AQuA-RAT and GSM8K, respectively.
- PaLM 2 showed improved performance on all datasets except AQuA-RAT.
- ∅-shot prompting had minimal, even negative, impact on GPT-4 Turbo, especially in common sense reasoning tasks like StrategyQA and WinoGrande.
Why these results? Here are some examples from the authors:
- Outputs generated by GPT-4 Turbo for the StrategyQA dataset using ∅-shot prompting:

2. Outputs generated by GPT-3.5 Turbo for the StrategyQA dataset using ∅-shot prompting:

Notably, GPT-4 Turbo honestly states there are no extra examples to reference, while GPT-3.5 Turbo is significantly influenced by ∅-shot prompts, even creating an example on its own.
The authors suggest these examples may derive from the model’s internal knowledge, i.e., its trained weight parameters.
This indicates GPT-3.5 Turbo is more easily misled, hence the larger gains from ∅-shot prompts. It also implies GPT-3.5 Turbo is more prone to hallucinations.
∅-shot Prompts for Hallucination Detection
The authors believe ∅-shot prompts enhance performance as large models follow instructions to generate examples internally when none are provided, leading to hallucinations.
GPT-4 Turbo experienced fewer hallucinations from ∅-shot prompts, suggesting it is better at handling them. The minimal effect of non-factual phrases in ∅-shot prompts on this model aligns with reports that GPT-4 is less prone to hallucinations compared to GPT-3.5.
Thus, ∅-shot prompts can not only enhance performance in hallucination-prone LLM models but also serve as a method to gauge the extent of hallucinations in LLMs. In other words, the higher the performance gain from ∅-shot prompts compared to the baseline, the greater the likelihood of the model generating hallucinatory responses.
PaLM 2, showing the most significant performance gain in ∅-shot prompting, is inferred to be the most susceptible to hallucinations among the four models. PaLM 2 for Chat mitigates some hallucinations, as its gains are smaller.
This method of detecting model hallucinations using ∅-shot prompts does not require any specialized hallucination detection dataset and can be applied to any existing benchmark dataset across various tasks.
Ablation Studies
- Impact of Model Size
The paper also explored the effect of ∅-shot prompts on smaller models, specifically testing on Llama 2 7B and Llama2 7B Chat.


As seen above, all tasks except AQuA and WinoGrande showed improved performance with Llama 2 7B. However, only GSM8K saw a positive impact with Llama 2 7B Chat, with performance declining on other datasets.
This mirrors the pattern observed with PaLM 2 and PaLM 2 for Chat, where the base version outperforms the chat version in the same model series, suggesting Llama 2 7B Chat is more adept at handling hallucinations. This aligns with the research behind the development of Llama 2.
- ∅-shot Prompts Combined with Zero-sample CoT
Given the significant performance boost of zero-sample CoT prompts (0CoT), the authors combined 0CoT with ∅-shot prompts into ∅CoT prompts.

Compared to original 0CoT, the results are as follows:


- Compared to 0CoT prompts, ∅CoT prompts were ineffective on most tasks. This could be due to both prompting methods requiring step-by-step reasoning and explanation, and our ∅CoT prompts may have hindered the model’s reasoning capabilities, leading to poorer performance than 0CoT prompts.
- In the WinoGrande dataset, ∅CoT prompts were highly effective for GPT-4 Turbo. The sudden performance increase might indicate that the task likely requires reasoning (from the 0CoT part) and invalid examples (from the ∅-shot part). This suggests that ∅CoT prompts could potentially break through measures to reduce hallucinations in stronger models, especially in tasks requiring complex reasoning.
Impact of ∅-shot Prompt Placement
The authors compared placing the phrase before the task instruction and at the end of the prompt, as shown below. The ∅-shot prompt at the beginning demonstrated higher effectiveness, except for the GSM8K dataset, which requires the model to generate any numerical answer.


The authors believe this is because placing content at the beginning demonstrates stronger conditional intensity, making these models more dependent on this condition when generating outputs.
Conclusion
This paper introduces a pretend-example prompt method — ∅-shot prompting — guiding models to utilize their internal knowledge. Moreover, the authors found ∅-shot prompting to be a simple yet effective method for hallucination detection. In addition, they conducted various ablation studies exploring scaling effects, reasoning variants, and the impact of prompt phrase placement and performance contributions of each component in the phrase.
Future research could explore using ∅-shot prompts to detect hallucinations in LLMs and the potential for combining it with other prompt engineering techniques.






