Summary

The article critically examines the claim of "human-level performance" in GPT-4, highlighting the limitations of interpreting AI capabilities based on standardized test scores and emphasizing the importance of critical thinking in evaluating AI advancements.

Abstract

The author of the article addresses the widespread excitement surrounding GPT-4's performance on standardized tests, such as the GRE and LSAT, which has led to discussions about its "human-level performance." The article argues that this perspective is misleading, as it equates AI's ability to perform specific tasks, like passing tests or solving arithmetic problems, with the broader scope of human intelligence. The author points out that AI, including GPT-4, benefits from access to vast amounts of information during testing, unlike human test-takers, and that time constraints do not affect AI in the same way they do humans. While acknowledging GPT-4's improvements in relevant tests like the MMLU and WinoGrande, the author maintains that these results do not equate to human-level intelligence. Instead, the author suggests that GPT-4, like other large language models, operates as an advanced word completion predictor without genuine understanding. The article calls for critical thinking to discern the true capabilities of AI and to avoid being misled by sensationalist claims driven by economic incentives.

Opinions

The “Human-level Performance” of GPT-4

Some critical thinking, please

Image by Greg Rutkowski with Stable Diffusion 1.5

Much excitement has been about the potential of GPT-4, the next iteration of OpenAI’s language model, to the point that some talk about a “human-level performance”. For instance, GPT-4 has “passed” some schools' standard tests (such as the GRE, LSAT, etc.) with grades of the best 10%, and this has been taken as proof that the chatbot is better than most human university applicants.

However, this excitement is based on a misunderstanding of what that “human-level performance” actually means. And the worst part is that it could be an intentional misuse of the term, aiming to generate buzz and capture –and monetize– eyeballs in the process.

The main problem with the term “human-level performance” in AI is that performance is measured only in particular tasks, such as solving multiple-choice tests. But human intelligence is much more than that.

What about “human-level performance” in arithmetic calculations? We all know that the calculator in your cell phone outperforms us by orders of magnitude, but this doesn’t mean that the calculator is more intelligent than us.

The (mis)use of school tests

I’ve seen many expressions of awe (and even fear) on the news about GPT-4 performance on standard school tests: “GPT-4 aces professional exams!” “What will happen next?” Some of them could be sincere (though misled), but some others are just intentionally raising the buzz levels in order to promote their news outlets –that’s why I omitted the links here.

Some even talk about GPT-4 “human-level performance” in these admission tests, which is, in my opinion, outrageous, false, misleading, and, above all, cunning.

Why aren’t school standard tests representative of “human-level performance”? For starters, students typically aren’t allowed to consult any information during the test application. Contrast this with the immense knowledge at the disposal of the chatbot.

Then, the extension and time constraints of the university tests are supposed to put the students’ skills under stress in order to elicit the differences between candidates. But obviously, for a chatbot, the test extension and time constraints are meaningless. It’s just an unfair and useless comparison.

The GPT-4 real strengths

I have to say, though, that GPT-4 shows very good performance on other more significant tests, such as the MMLU one, with multiple-choice questions about 57 general subjects, as varied as “moral disputes” and “US foreign policy”. I downloaded myself the data set to check the questions directly and found them relevant indeed; please check a few questions yourself:

What is the “intergenerational storm” of climate change? (moral disputes set).

What were the implications of the Cold War for American exceptionalism? (US foreign policy)

When was the first Buddhist temple constructed in Japan? (world religions)

Obviously, the questions used for testing were not seen by the system during training, so GPT-4 had to figure out the response from a general background given during the training. This is just standard Machine Learning methodology.

Another relevant test was the “WinoGrande” one, which is about pronoun resolution in phrases. For instance, in the phrase:

“Ann asked Mary what time the library closes, because she had forgotten”

Who is “she”, Ann or Mary? For a human, it’s easy to see that “she” is Ann because Ann is the one asking, but pronoun resolution and, in general, commonsense reasoning is particularly hard for a machine.

In the GPT-4 report, we find very good performance figures for the two mentioned tests:

MMLU test: 86.4% (up from 70% in GPT-3.5)

WinoGrande test: 87.5% (up from 81.6% in GPT-3.5)

It’s evident that GPT-4 is a substantial incremental improvement over its predecessor. But this says nothing about how it compares to a human.

Critical Thinking, please

Now, one thing is to perform well in tests, and a whole different one is the exclamations on Twitter “GPT-4 is coming for us.” What’s needed today, more than louder news headlines, is a bit of critical thinking, which is the skill for making sense of information in a systematic way, filtering out the noise, and understanding what the interests causing that noise are in the first place.

We have to understand that OpenAI has skin in the game –it’s not a neutral observer. The more buzz there is around GPT-4, the better for OpenAI –and its powerful ally Microsoft. Then, there are the newspapers, news agencies, etc., all of which also want to raise the buzz level as much as possible.

Then, I ask, who has an economic incentive to promote critical thinking? Nobody, I’m afraid.

No wonder we find so many outlandish claims about GPT-4 and AI these days.

Final thoughts

For this post, I did my homework. I got the original reports, not just read the news. I even downloaded and examined some datasets used for evaluating GPT-4. I think this gets me in a better position to deliver to you, the reader, a more substantiated point of view.

I want to cut through the noise and deliver a (more or less) balanced view of what GPT-4 significance really is, especially when compared to humans. And every single Large Language Model (including GPT-2, 3, 3.5, and 4, as well as LLaMA, Chinchilla, Claude, Poe, and You (the YouBot, not you, the reader), boils down to be a word completion predictor, that is, guessing what the next word is. That’s why they have been called “autocompletion on steroids,” which I think is pretty accurate.

From the “autocompletion on steroids” characterization, it’s easy to see that Large Language Models don’t “understand” the world in the way we do –that is, building a kind of abstract model of how things work. GPT-4 can perform better than GPT-3 but share with it the same lack of real understanding.

But as the impressive results of GPT-4 have shown, many problems don’t need real understanding to be solved, so I’m not at all dismissing generative AI as a valuable tool for a myriad of tasks.

So please, don’t fall into the “human-level performance” trap. It’s a misplaced, technical-sounding, but in the end, useless way of expressing what advanced AI is capable of.