avatarPeter Xing

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

2739

Abstract

hain of Thought (CoT):</i> It encourages the model to generate intermediate reasoning steps, thereby improving complex reasoning capabilities.</li><li><i>Majority Vote Ensembling: </i>This technique combines multiple outputs to yield better predictive performance, enhanced by choice-shuffling for robustness.</li></ol><figure id="0564"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*LMwC7sKn9olzulUc.png"><figcaption></figcaption></figure><p id="55c6"><b>Extending Medprompt:</b> Medprompt+ extends the original framework by incorporating simple, direct prompts alongside the sophisticated CoT-based ones. This approach dynamically selects the most appropriate technique for each problem, leading to improved performance across diverse MMLU challenges.</p><p id="4012"><b>Future Directions:</b> The development of Medprompt and Medprompt+ marks a significant milestone in the realm of AI prompting strategies. These methodologies not only demonstrate the capabilities of generalist models like GPT-4 in specialist domains but also pave the way for more nuanced and efficient prompting strategies. The field is rapidly evolving, and with platforms like Promptbase, the AI community can expect continuous advancements and collaborative opportunities in prompt engineering.</p><p id="da60"><b>Gemini Ultra Controversies</b></p> <figure id="6c22"> <div> <div> <img class="ratio" src="http://placehold.it/16x9"> <iframe class="" src="https://cdn.embedly.com/widgets/media.html?type=text%2Fhtml&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;schema=twitter&amp;url=https%3A//twitter.com/peterxing/status/1734812286324539634%3Fs%3D20&amp;image=https%3A//i.embed.ly/1/image%3Furl%3Dhttps%253A%252F%252Fabs.twimg.com%252Ferrors%252Flogo46x38.png%26key%3Da19fcc184b9711e1b4764040d3dc5c07" allowfullscreen="" frameborder="0" height="281" width="500"> </div> </div> </figure></iframe></div></div></figure><p id="194f">Google DeepMind’s latest AI model, Gemini has sparked intense debates on social media regarding its comparison with OpenAI’s GPT-4. Influencers are particularly focused on the evaluation methods and practical applications of these models:</p><ul><li><i>Smitarani Tripathy, Social Media Analyst, GlobalData:</i><b> </b>Highlights the debate over Gemini AI’s evaluations, noting that influencers find Gemini Ultra underperforming compared to GPT-4 in standard evaluations. The CoT@32 method is seen as impractical in real-world applications, emphasizing GPT-4’s superiority and the need for more transparent evaluations.</li><li><i>Saurabh Kumar, Co-Founder, Adora:</i> Critiques Gemini’s use of uncertainty routed CoT

Options

for claiming higher MMLU scores, pointing out GPT-4’s superior performance in standard evaluations and questioning the lack of explanation for the technique’s benefits.</li><li><i>Harry Surden, Professor of Law, University of Colorado:</i><b> </b>Expresses disappointment in Gemini Ultra’s need for CoT@32 to surpass GPT-4, expecting Gemini to perform better in standard 5-shot evaluations.</li><li><i>Shital Shah, Principal Research Engineer, Microsoft: </i>Observes that while Gemini beats GPT-4 with CoT@32, it falls short in 5-shot evaluations, suggesting Gemini’s inherent power is not fully realized without proper prompting.</li><li><i>Bindu Reddy, CEO, Abacus.AI:</i> Points out that Gemini’s lead over GPT-4 in MMLU is specific to CoT@32 and that GPT-4 maintains a lead in standard 5-shot evaluations.</li><li><i>Ethan Mollick, Professor, The Wharton School:</i><b> </b>Raises questions about Gemini Ultra’s capabilities and its narrow margin of outperforming GPT-4, pondering the implications for the future of large language models (LLMs).</li><li><i>Brett Winton, Investment Advisor, ARK Invest:</i> Criticizes the comparisons between prompt-engineered Gemini and non-engineered GPT-4, calling for like-for-like evaluations.</li></ul> <figure id="ab22"> <div> <div> <img class="ratio" src="http://placehold.it/16x9"> <iframe class="" src="https://cdn.embedly.com/widgets/media.html?type=text%2Fhtml&amp;key=d04bfffea46d4aeda930ec88cc64b87c&amp;schema=twitter&amp;url=https%3A//twitter.com/peterxing/status/1734562078000841160%3Fs%3D20&amp;image=https%3A//i.embed.ly/1/image%3Furl%3Dhttps%253A%252F%252Fabs.twimg.com%252Ferrors%252Flogo46x38.png%26key%3D4fce0568f2ce49e8b54624ef71a8a5bd" allowfullscreen="" frameborder="0" height="281" width="500"> </div> </div> </figure></iframe></div></div></figure><figure id="427b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*hqgoMCUjOQE47BwiFu7VMw.png"><figcaption></figcaption></figure><figure id="c85f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*IqADAm-wm08Qd2NHtYWjbg.png"><figcaption></figcaption></figure><p id="db65">The release of Google’s Gemini AI has led to significant discussions among AI influencers, focusing on its comparative performance with GPT-4 and the methodologies used in evaluations. There is a general consensus on the need for direct, comparable evaluations to gauge the true capabilities of these models, with many influencers leaning towards GPT-4’s superiority in standard settings. The discussions highlight the evolving nature of AI technology and the complexities involved in its assessment.</p></article></body>

Microsoft’s Medprompt+ with GPT-4 Beats Gemini Ultra, Reclaims Benchmark Throne on MMLU

Well that was quick! Gemini Ultra from Google DeepMind’s purportedly superior performance benchmarks against OpenAI’s GPT-4 has been short-lived.

Microsoft Research just released a blog post about its Medprompt+ approach on GPT-4, retaking the benchmark throne against Gemini Ultra which was announced only a week ago.

Medprompt and its Evolution: Developed by a team at Microsoft Research, Medprompt represents a significant leap in prompting strategies. It utilizes specialized techniques to draw out the expertise-like responses from AI models. The approach has been extended into a more robust version, known as Medprompt+, which integrates simple and complex prompting methods. This combination has been instrumental in achieving state-of-the-art (SoTA) results on various benchmarks.

Performance on MMLU Benchmark: The Measuring Massive Multitask Language Understanding (MMLU) challenge is a comprehensive test of general knowledge and reasoning abilities of large language models. The Medprompt approach has shown exceptional performance on this benchmark, with the modified version, Medprompt+, achieving a record score of 90.10%, surpassing other models like Google’s Gemini Ultra.

Promptbase — A Resource Hub: Promptbase, a repository on GitHub, has been introduced to disseminate information and tools for maximizing the performance of foundation models. It includes scripts for replicating results using the Medprompt methodologies and will continue to expand with more resources in the future.

Techniques Behind Medprompt: Medprompt combines several strategies:

  1. Dynamic Few Shots: This involves selecting task-specific few-shot examples dynamically, enhancing relevance and adaptability.
  2. Self-Generated Chain of Thought (CoT): It encourages the model to generate intermediate reasoning steps, thereby improving complex reasoning capabilities.
  3. Majority Vote Ensembling: This technique combines multiple outputs to yield better predictive performance, enhanced by choice-shuffling for robustness.

Extending Medprompt: Medprompt+ extends the original framework by incorporating simple, direct prompts alongside the sophisticated CoT-based ones. This approach dynamically selects the most appropriate technique for each problem, leading to improved performance across diverse MMLU challenges.

Future Directions: The development of Medprompt and Medprompt+ marks a significant milestone in the realm of AI prompting strategies. These methodologies not only demonstrate the capabilities of generalist models like GPT-4 in specialist domains but also pave the way for more nuanced and efficient prompting strategies. The field is rapidly evolving, and with platforms like Promptbase, the AI community can expect continuous advancements and collaborative opportunities in prompt engineering.

Gemini Ultra Controversies

Google DeepMind’s latest AI model, Gemini has sparked intense debates on social media regarding its comparison with OpenAI’s GPT-4. Influencers are particularly focused on the evaluation methods and practical applications of these models:

  • Smitarani Tripathy, Social Media Analyst, GlobalData: Highlights the debate over Gemini AI’s evaluations, noting that influencers find Gemini Ultra underperforming compared to GPT-4 in standard evaluations. The CoT@32 method is seen as impractical in real-world applications, emphasizing GPT-4’s superiority and the need for more transparent evaluations.
  • Saurabh Kumar, Co-Founder, Adora: Critiques Gemini’s use of uncertainty routed CoT for claiming higher MMLU scores, pointing out GPT-4’s superior performance in standard evaluations and questioning the lack of explanation for the technique’s benefits.
  • Harry Surden, Professor of Law, University of Colorado: Expresses disappointment in Gemini Ultra’s need for CoT@32 to surpass GPT-4, expecting Gemini to perform better in standard 5-shot evaluations.
  • Shital Shah, Principal Research Engineer, Microsoft: Observes that while Gemini beats GPT-4 with CoT@32, it falls short in 5-shot evaluations, suggesting Gemini’s inherent power is not fully realized without proper prompting.
  • Bindu Reddy, CEO, Abacus.AI: Points out that Gemini’s lead over GPT-4 in MMLU is specific to CoT@32 and that GPT-4 maintains a lead in standard 5-shot evaluations.
  • Ethan Mollick, Professor, The Wharton School: Raises questions about Gemini Ultra’s capabilities and its narrow margin of outperforming GPT-4, pondering the implications for the future of large language models (LLMs).
  • Brett Winton, Investment Advisor, ARK Invest: Criticizes the comparisons between prompt-engineered Gemini and non-engineered GPT-4, calling for like-for-like evaluations.

The release of Google’s Gemini AI has led to significant discussions among AI influencers, focusing on its comparative performance with GPT-4 and the methodologies used in evaluations. There is a general consensus on the need for direct, comparable evaluations to gauge the true capabilities of these models, with many influencers leaning towards GPT-4’s superiority in standard settings. The discussions highlight the evolving nature of AI technology and the complexities involved in its assessment.

Medprompt
Gpt4
Gemini
AI
Artificial Intelligence
Recommended from ReadMedium