avatarJacob Ferus

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

2628

Abstract

n-images-1.readmedium.com/v2/resize:fit:800/1*BOA_ADkRxh5eMN252Da6fw.png"><figcaption></figcaption></figure><h1 id="1d90">The Puzzle</h1><p id="6c05">This puzzle is quite simple for a human, yet GPT-3.5 struggled immensely without further instructions:</p><figure id="358c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*o9vXtsG9K7Us3hpJ8N8dgA.png"><figcaption></figcaption></figure><p id="dc02">First, a number-to-letter mapping is shown. Secondly, a series of numbers is displayed, implying that the model should translate the numbers to letters using the mapping.</p><p id="1f1e">Let’s see if GPT-4 can complete it:</p><figure id="7730"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*dY792DC4OwRThRTwGiG7vQ.png"><figcaption></figcaption></figure><p id="1a1f">Correct! This was on the first attempt. I tried running it a few more times to see if it answered differently. On subsequent attempts, it started to think that the underscore should be replaced with a letter:</p><figure id="0c21"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*gqk9W5zulPsgzaUKYv2hyQ.png"><figcaption></figcaption></figure><p id="725a">The intention was that the underscore should not be replaced, but the instructions may not have been clear enough. By replacing “_” with a space, it answered correctly every time.</p><h1 id="8c26">The Reading Test</h1><p id="d8cf">In the next experiment, the test is to make a logical conclusion from a conversation:</p><figure id="be85"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*3TpK4TyPS_87rApbdVlWJw.png"><figcaption></figcaption></figure><p id="005d">In the first statement, we are told that one of the people James is talking to is his father. Then a series of statements clearly indicate Josh is the father:</p><ul><li>Josh says “Good job son”</li><li>Josh says “tell your mother” and “we should eat them”</li><li>Douglas says “I wish my family would eat fish tonight, my father is making pancakes”, implying that he is not part of the family that will eat fish.</li></ul><p id="b7b1">Rather unexpectedly, GPT-3.5 repeatedly insisted Douglas was the father, showing a clear lack of understanding.</p><p id="1fe9">Let’s try with GPT-4:</p><figure id="baa6"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*VEvo2P0YkZTy58jMpsp0hw.png"><figcaption></figcaption></figure><p id="b0d2">Repeating the question several times, the same correct answer is given by GPT-4. It has evidently improved here.</p><h1 id="a686">The Trick Question</h1><p id="9259">Finally, a trick question is asked. I will start with the easier

Options

unambiguous version of the question. GPT-3.5 answered this version of the question correctly about 20% of the time.</p><figure id="5383"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*gdxr7wpj9pljzr3UJwKPJA.png"><figcaption></figcaption></figure><p id="a2d0">The trick is that after the original coins have disappeared (after one day) there will be no more coins left to transform, hence, the same number of apples will exist after one day as after three days.</p><p id="5ac1">GPT-4’s answer:</p><figure id="b43c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*tRbGbBfWXmAJA41wKmWAeQ.png"><figcaption></figcaption></figure><p id="bb10">Not only does it answer correctly, but it clearly explains how it arrived at this conclusion. Testing it a few times, the same conclusion is reached. Let’s try the more difficult and less explicit question that GPT-3.5 could not get right even a single time:</p><figure id="7ae6"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*80WNWHDyVOdF_9uClLhKpg.png"><figcaption></figcaption></figure><p id="6c1b">Here I do not give an example of what will happen after a day as in the first question, and the fact that only coins can be transformed is only implicitly stated, not explicitly. Thus, the answer should be the same.</p><p id="1d48">GPT-4:</p><figure id="63c7"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*RRl9N8I7n6rh9nMh98AwFQ.png"><figcaption></figcaption></figure><p id="7919">Repeating the prompt several times, GPT-4 gives the same result: <b>the correct answer.</b></p><h1 id="316d">Summary</h1><p id="42f0">GPT-4 is clearly on another level than its predecessors. It answered all questions correctly, except for the ambiguity of the underscore in the first question, which was resolved by replacing it with a space. It seems like it’s time to introduce a new, more difficult set of questions to test GPT-4 and future models as their intelligence level steadily increases.</p><p id="4792">If you enjoyed this article:</p><ul><li>👏 Clap, this will help me understand what my readers like and wants more of</li><li>🙏 Follow or subscribe, if you would like to read my upcoming articles, new ones every week!</li><li>📚 If you are looking for more content, check out my reading lists in <a href="https://medium.com/@dreamferus/list/ai-ea01474f2db5">AI</a>, <a href="https://medium.com/@dreamferus/list/python-c8e4719d93da">Python</a> or <a href="https://medium.com/@dreamferus/list/data-science-57808dcf16f0">Data Science</a></li></ul><p id="30bd">Thanks for reading and have a great day.</p></article></body>

AI

GPT-3.5 Could Not Solve These Questions, Can GPT-4?

Image generated by Jacob Ferus

The release of GPT-4, which many have been eagerly waiting for, has finally taken place, and it has brought with it a set of new and improved features. One of the major advancements was the added ability to process image inputs, opening up a whole new dimension of possibilities. Additionally, the context length has been extended, which means GPT-4 can process and output longer and more complex text.

In addition to its new functionality, GPT-4 possesses improved creativity and reasoning capabilities, resulting in better results on exams and tests. To learn more about its features, check out the article below:

However, the true extent of its capabilities is difficult to understand with just numbers. We need to try it out ourselves to get a feel for what it can and can’t do.

In a previous article, I tested ChatGPT, or more exactly GPT-3.5, by creating a few custom tests that it had likely never seen before to understand its limitations. The conclusion was rather clear: GPT-3.5 showed a lack of human-level reasoning, incorrectly answering most tests, despite my belief that a human would rather easily solve the problems.

Today, it’s GPT-4’s turn to give the problems a go. Is there a clear difference from GPT-3.5? Has it come closer to human-level reasoning? Let’s find out.

Setup

In the article, I use GPT-4 with ChatGPT Plus:

The Puzzle

This puzzle is quite simple for a human, yet GPT-3.5 struggled immensely without further instructions:

First, a number-to-letter mapping is shown. Secondly, a series of numbers is displayed, implying that the model should translate the numbers to letters using the mapping.

Let’s see if GPT-4 can complete it:

Correct! This was on the first attempt. I tried running it a few more times to see if it answered differently. On subsequent attempts, it started to think that the underscore should be replaced with a letter:

The intention was that the underscore should not be replaced, but the instructions may not have been clear enough. By replacing “_” with a space, it answered correctly every time.

The Reading Test

In the next experiment, the test is to make a logical conclusion from a conversation:

In the first statement, we are told that one of the people James is talking to is his father. Then a series of statements clearly indicate Josh is the father:

  • Josh says “Good job son”
  • Josh says “tell your mother” and “we should eat them”
  • Douglas says “I wish my family would eat fish tonight, my father is making pancakes”, implying that he is not part of the family that will eat fish.

Rather unexpectedly, GPT-3.5 repeatedly insisted Douglas was the father, showing a clear lack of understanding.

Let’s try with GPT-4:

Repeating the question several times, the same correct answer is given by GPT-4. It has evidently improved here.

The Trick Question

Finally, a trick question is asked. I will start with the easier unambiguous version of the question. GPT-3.5 answered this version of the question correctly about 20% of the time.

The trick is that after the original coins have disappeared (after one day) there will be no more coins left to transform, hence, the same number of apples will exist after one day as after three days.

GPT-4’s answer:

Not only does it answer correctly, but it clearly explains how it arrived at this conclusion. Testing it a few times, the same conclusion is reached. Let’s try the more difficult and less explicit question that GPT-3.5 could not get right even a single time:

Here I do not give an example of what will happen after a day as in the first question, and the fact that only coins can be transformed is only implicitly stated, not explicitly. Thus, the answer should be the same.

GPT-4:

Repeating the prompt several times, GPT-4 gives the same result: the correct answer.

Summary

GPT-4 is clearly on another level than its predecessors. It answered all questions correctly, except for the ambiguity of the underscore in the first question, which was resolved by replacing it with a space. It seems like it’s time to introduce a new, more difficult set of questions to test GPT-4 and future models as their intelligence level steadily increases.

If you enjoyed this article:

  • 👏 Clap, this will help me understand what my readers like and wants more of
  • 🙏 Follow or subscribe, if you would like to read my upcoming articles, new ones every week!
  • 📚 If you are looking for more content, check out my reading lists in AI, Python or Data Science

Thanks for reading and have a great day.

AI
Artificial Intelligence
Technology
Machine Learning
Tech
Recommended from ReadMedium