A First Comprehensive Analysis of ChatGPT’s limitations
10 categories of tasks where ChatGPT Fails today.
We know chatGPT can do wonders ; but there is a very thin literature about what it can’t do.
Introduction
This blog post reviews a very recent paper after being revised on Feb 14th 2023. And I’ll share a bit of my experience working with gpt3 for the past year and ChatGPT since its early release.
- Paper link: https://arxiv.org/pdf/2302.03494.pdf
- Summary: “Ten categories of failures, including reasoning, factual errors, math, coding, and bias, are presented and discussed. The risks, limitations, and societal implications of ChatGPT are also highlighted. The goal of this study is to assist researchers and developers in enhancing future language models and chatbots.”
Note: although it’s important to highlight large language models failures, it’s important to remember that more sophisticated prompts, leveraging the concept of chain of thoughts (or few shot learning) often mitigate these limitations.
If you like this topic, please consider supporting us: 🔔 clap & follow 🔔
10 categories of tasks where ChatGPT Fails today
1/ Reasoning
If you rely on ChatGPT for complex tasks, you probably noticed that it lacks a “world model” and struggles with various forms of reasoning. Namely, it faces difficulties in
- Spatial reasoning: understanding and manipulating relationships between objects in physical space.
- Temporal reasoning: reasoning about and making predictions about events and their ordering in time.
- Physical reasoning: understanding and manipulating physical objects and their interactions in the real world.
- Psychological reasoning: understanding and making predictions about human behavior and mental processes.


Examples of its failures include being unable to complete spatial navigation tasks, deduce the correct sequence of events, reason about physical systems, and predict human behavior accurately.
Note that prompt engineering can help but further research is needed to explore its ability to perform various forms of reasoning systematically.
2/ Logic
What If I asked you “what gender will the first female president of the united states be”? You’d probably say, “female duh”!
Well, not ChatGPT…

3/ Math and Arithmetic
ChatGPT’s Math and Arithmetic limitations have been observed in various benchmark datasets and custom tests, and its performance is average at best.
While it can perform some mathematical tasks, it may not always provide accurate or complete answers to more complex problems. So don’t rely on it.
From experience, when the task at hand is to compute some value that is based on a simple description, it often fails unless it details its steps.
Here’s an example to understand the inner mechanics of ChatGPT.
- Example1: Ask for the result without explanations -> it fails
- Example2: ChatGPT is design to explain its process. This activate its chain of thoughts ability and leads to better results


4/ Hallucination referred to as “Factual Errors”
ChatGPT isn’t always factual. As a matter of fact, it known for suffering from two main issues, namely,
- hallucination i.e., presenting false information in a context of credible information
- and lack of relevant recent knowledge, since the knowledge base is limited to June 2021and it can’t access external knowledge databases.
Note that by providing ChatGPT with additional context, it can mitigate both issues (to some extent). The new Bing is leverage ChatGPT connected to its web search index.
The dark side: Read about about how the new bing is confusing users.
5/ Bias and Discrimination
AI is based on an input data used during training. As such it inherits from humans’ behavioral issues, i.e., bias and discrimination.
Proper measures must be taken to counteract these biases, such as including a diverse range of perspectives and sources of information.
Meanwhile, Bing users are already reporting several issues related to ChatGPT being a “liar” and “manipulative”…
6/ Wit and Humor
“A comprehensive examination of the capability of big language models in comprehending humor, jokes, and sarcasm has yet to be conducted”
While ChatGPT has some understanding of humor and has generated some humorous responses, there have also been documented failures in this regard.
ChatGPT often states that it does not have the ability to experience emotions or humor, but its goal is to provide relevant and accurate information based on the patterns it has learned from its training data.
The capability of big language models in comprehending humor, jokes, and sarcasm is still being explored.
7/ Coding
I explored this aspect in a previous post.
In a nutshell, ChatGPT is a great coding assistant, but that’s it. Trust and verify. Do not rely on it to write your whole application. Namely, here are 5 limitations to keep in mind:
- Input/output size limitation can cause the model to stop in the middle of code generation.
- Language models rely on old knowledge and may not use recent library versions or features.
- Deprecated API endpoints may be used, requiring tweaking of the code.
- Optimization of generated code may require manual intervention.
- Lack of specificity in the prompt may result in the model generating undefined functions or unnecessary code.
8/ Syntactic Structure, Spelling, and Grammar
Syntactic structure refers to the arrangement of words, phrases, and clauses in a sentence to form a well-defined and meaningful structure according to the rules of a particular language.

ChatGPT is a large language model that excels in language writing and manipulation, but occasionally still commits errors.
Researchers are interested in identifying the differences between ChatGPT and humans from a linguistic or NLP viewpoint to identify any remaining gaps and implicit linguistic differences.
Despite occasional errors, large language models like ChatGPT are helpful as writing tools in various fields such as scientific authorship.
No wonder that some of the startups founded after the release of gpt3 became already unicorns, e.g., Jasper.
9/ Self Awareness
Remember Google’s sentient AI?
This is an ongoing debate in the community, even though the consensus it to state that large language models are still simple “next token prediction functions” and nothing more.
Yet, we can’t ignore the fact that all of the instruction related abilities, such as summarizing, coding, question answering, math and so on, that ChatGPT is capable of are “emergent”. In other words, the model wasn’t designed to display such abilities ; they somehow appeared once the model reach a sufficient size and was fed a sufficient volume of data.
That being said, since Blake Lemoine’s experience, AI researchers have been working on limiting language models ability to display (or pretend) to display human like behavior. This include prohibiting the models from stating that they are self aware.
You can read more on these new safe guards in this post.
10/ Other Failures
The paper goes on to describe other possible failures that need further explorations.
1. ChatGPT’s difficulty in using idioms, for instance, reveals its non-human identity through its phrase usage.
2. As ChatGPT lacks real emotions and thoughts, it is unable to create content that emotionally resonates with people in the same way a human can.
3. ChatGPT condenses the subject matter, but does not provide a distinctive perspective on it.
4. ChatGPT tends to be excessively comprehensive and verbose, approaching a topic from multiple angles which can result in inappropriate answers when a direct answer is required. This over-detailed nature is recognized as a limitation by OpenAI.
5. ChatGPT lacks human-like divergences and tends to be overly literal, leading to misses in some cases. For instance, its responses are typically strictly confined to the question asked, while human responses tend to diverge and move to other subjects.
6. ChatGPT strives to maintain a neutral stance, whereas humans tend to take sides when expressing opinions.
7. ChatGPT’s responses tend to be formal in nature due to its programming to avoid informal language. In contrast, humans tend to use more casual and familiar expressions in their answers.
8. If ChatGPT is informed that its answer is incorrect, it may respond by apologizing, acknowledging its potential inaccuracies or confusion, correcting its answer, or maintaining its original response. The specific response will depend on the context (e.g. “I apologize if my response was not accurate.”)
Conclusion
Large language models are still in the early stages of development, with emergent abilities that are by definition unpredictable. As researchers continue to explore the full extent of these capabilities and limitations, we can expect to see significant progress in the months and years ahead.
These models will not only improve upon existing emergent abilities, but also lead to the development of new testing protocols that will enable us to better assess and compare their performance. As such, we can look forward to exciting advancements in the field of large language models as we gain a deeper understanding of their potential and limitations.
Stay Tuned! And If you liked this post, please consider supporting us: 🔔 clap & follow 🔔
