Generative AI: A Fresh Survey

The past, the present and the future of Language Models

OpenAI 3D Logo. Image by thefactsite.com

In the dynamic landscape of Generative Artificial Intelligence (GenAI), maintaining pace with the latest developments can be a daunting task. But don’t worry, I’ve recently found a great paper on ArXiV that explores recent trends and future directions, and I’ll break it down in this story.

Outline: 1. The History of Generative AI: Let's Start from the Beginning 2. Current Challenges: Fine-Tuning, Hallucinations and Alignment 3. New Trends: Mixture of Experts (MoE) and Multimodal Models 4. Real-world Applications of Generative AI and Ethical Considerations 5. The Future of AI: The OpenAI's Q* Project

1. The History of Generative AI: Let’s Start from the Beginning

The rise of Generative AI has been marked by significant milestones, with each new model paving the way for the next evolutionary leap. Models, indeed, have undergone a transformative journey, evolving from rudimentary statistical methods to the complex neural network architectures that underpin today’s Large Language Models (LLMs).

Figure 1: Timeline of Large Language Models — Design by Armin Norouzi

The inception of language modeling (Fig. 1) can be traced back to the statistical approaches of the late 1980s, a period marked by a transition from rule-based to machine learning algorithms in Natural Language Processing (NLP). Early models, primarily n-gram based, calculated the probability of word sequences in a corpus, thus providing a rudimentary understanding of language structure. These models, though simplistic, laid the groundwork for future advances in language understanding.

The rise in computational power in the late 1980s sparked a revolution in NLP, shifting the focus towards statistical models capable of making ‘soft’ probabilistic decisions, as opposed to the rigid, ‘handwritten’ rule-based systems that dominated early NLP systems.

In the following decade, the popularity and applicability of these statistical models skyrocketed, proving invaluable in managing the flourishing flow of digital text. The 1990s saw the firm establishment of statistical methods in NLP research, with n-grams playing a crucial role in numerically capturing linguistic patterns.

A significant milestone was reached in 1997 with the introduction of Long Short-Term Memory (LSTM) networks and their application to voice and text processing, leading to the current era where neural network models represent the cutting edge of NLP research and development.

The emergence of deep learning has revolutionized the field, leading to the creation of language models such as GPT, BERT, RoBERTa, BART or DeBERTa and later, notably, LLMs such as OpenAI’s ChatGPT (November 2022). Recent models like GPT-4, LLaMA, Google Bard and Anthropic Claude have further pushed the boundaries of AI by showcasing unprecedented levels in language understanding and generation.

2. Current Challenges: Fine-Tuning, Hallucinations and Alignment

The rapid proliferation of LLMs, and their extensive utilization in the last few months, has emphasized the significance of fine-tuning, hallucination reduction, and alignment. These aspects play a crucial role in enhancing the functionality and reliability of LLMs.

Fine-tuning, i.e. the process of adapting pre-trained models to specific tasks, has made notable strides. Techniques such as prompt-based and few-shot learning, coupled with supervised fine-tuning on specialized datasets, have enhanced the adaptability of LLMs across various contexts. Despite this progress, challenges persist, particularly in addressing biases and ensuring the generalization of models across diverse tasks.

Persistent in LLMs is also the challenge of reducing hallucinations, referred to as the generation of confidently asserted yet factually incorrect information (Fig. 2). However, this issue has been partially mitigated by the introduction of Retrieval-Augmented Generation (RAG) models, i.e. models capable of retrieving relevant information before the actual text generation step.

Figure 2: An example of model hallucination. Picture by Karen Weise and Cade Metz (The New York Times)

If you want to know more about AI hallucinations, you can check out the following article for a more detailed overview.

How to Mitigate Hallucinations in Large Language Models (LLMs)

A 2024 survey on state-of-the-art techniques for mitigating hallucination in LLMs

generativeai.pub

Finally, concerning alignment, innovative approaches have been proposed to ensure that LLM outputs align with human values and ethics. Solutions range from constrained optimization to reward modeling techniques, all aiming to embed human preferences within AI systems, either during training or fine-tuning.

However, the complexity of aligning AI with the diverse spectrum of human ethics and the persistence of hallucinations, particularly on culturally-sensitive topics, highlight the need for continued interdisciplinary research in the development and application of LLMs.

3. New Trends: Mixture of Experts (MoE) and Multimodal Models

⭐ Mixture of Experts. The recently-adopted Mixture of Experts (MoE) setup is a big deal in the AI/LLM world (Fig. 3). This cool method, shown off by top-notch models like Google’s Switch Transformer and MistralAI’s Mixtral-8x7B, uses a bunch of transformer-based expert modules for dynamic token routing, making modeling more efficient and scalable.

Figure 3: The general MoE architecture. Image by Jongwon Yoon

One of the major benefits of MoE is how it can handle huge parameter scales, which cuts down on memory use and computational costs. This is done through model parallelism across specialized experts, which enables the training of models with trillions of parameters. Its specialization in dealing with diverse data distributions boosts its proficiency in tasks like few-shot learning.

Now, let’s consider its potential in healthcare. An MoE-based system could be used for personalized medicine, where different ‘expert’ modules specialize in various aspects of patient data analysis, including genomics, medical imaging, and electronic health records. This could significantly improve diagnostic accuracy and treatment personalization. Similarly, an MoE-based system could be used to create personalized gaming experiences, with distinct ‘experts’ focusing on player performance, play style, and in-game choices, respectively. Finally, in the field of marketing, MoE models could be used for consumer behavior analysis, with experts looking at different consumer indicators, market trends, and regulatory compliance factors.

However, to fully unlock the potential of MoE issues such as expert imbalance, dynamic routing complexity and probability dilution have to be addressed.

⭐Multimodal Models. Along the same lines, the rise of multimodal AI is changing the way in how machines understand and interact with all sorts of human sensory inputs and contextual data (Fig. 4). These models facilitate accurate and data-efﬁcient analysis by employing multi-view pipelines and cross-attention blocks. This integration of diverse inputs allows for a more nuanced and detailed interpretation of data, enhancing the model’s ability to accurately analyze and understand various types of information. Among these kinds of models, Google Gemini stands out as the latest multimodal conversational system, and it’s able to process text, documents, images, and code, but also audio and video.

Figure 4: Graphical comparison of unimodal and multimodal models. Image by Shehmir Javaid

However, the development of multimodal AI systems faces several technical hurdles, including creating robust and diverse datasets, managing scalability, and enhancing user trust and system interpretability. Challenges like data skew and bias are prevalent due to data acquisition and annotation issues, which requires effective dataset management by employing strategies such as data augmentation, active learning, and transfer learning. Another signiﬁcant challenge is the computational demands of processing various data streams simultaneously, requiring powerful hardware and optimized model architectures for multiple encoders.

4. Real-world Applications of Generative AI and Ethical Considerations

The use of generative AI models in real-world situations is showing us both the amazing possibilities and the tough challenges in different sectors.

Healthcare: In this sector, GenAI is making big strides in areas like diagnostic imaging and personalized medicine. For instance, it’s helping doctors spot diseases earlier and tailor treatments to individual patients. But it’s not all good news. There are serious worries about data privacy and the potential misuse of sensitive health information. We need to make sure that as we push forward with AI in healthcare, we’re also protecting patients’ personal information.
Finance: AI is proving to be a powerful tool in finance as well, especially when it comes to spotting fraud and making algorithmic trades. It’s fast, it’s accurate, and it’s efficient. But there are ethical issues we need to take into account. Automated decision-making processes can lack transparency and accountability, which raises questions about fairness and oversight.
Education: LLMs are opening up new possibilities in education, like creating personalized learning experiences. This could make education more accessible and instruction more tailored to individual students. But, again, there are hurdles to overcome. Not everyone has equal access to technology, and there’s the risk of biases in the AI-generated content. Additionally, if AI takes over some teaching tasks, what does that mean for human teachers?
Creative AI: This is a rising field that is pushing AI’s creative limits across different forms like images, audio, and video. It’s all about generating artistic content, from telling stories to writing poetry/news/posts, but also composing music or creating visual arts. It’s even led to commercial hits like MidJourney and DALL-E. But it’s not without its challenges. We need to figure out the best ways to represent data, the right algorithms to use, and how to measure creativity effectively. Specifically, with the rise of Creative AI, copyright issues have become a significant concern. As AI starts to create content that could be very similar to human-created content, it raises questions about who owns the rights to that content. It’s a complex issue that’s still being worked out, and it’s something that anyone working with Creative AI needs to be aware of.

An image generated with Midjourney v6. Image by mid-journey.ai

5. The Future of AI: The OpenAI’s Q* Project

First of all: “What is Q*?” The Q* project is another huge OpenAI’s initiative aimed at advancing AI technology. While OpenAI hasn’t published specific details about Q*, it’s known that the project is focused on developing an ethical, general-purpose AI system that is beneficial for the society. Furthermore, the goal of Q* is to demonstrate proficiency across a broad spectrum of challenges, including mathematical reasoning, particularly challenging for nowadays LLMs.

But “How do they plan to achieve this?” Rumors say that the Q* project is all about mixing Reinforcement Learning (RL) and AI search algorithms with the creativity of LLMs. While Gemini has made big strides in multimodal AI, combining different types of data inputs like text, images, audio, and video, Q* is expected to take us far beyond what we’ve achieved so far by bringing together creative reasoning and structured problem-solving. This can be achieved by combining the precision and efficiency of algorithms like A* with the adaptable Q-learning strategy, and the complex understanding of human language and context that LLMs offer.

This kind of integration could allow AI systems to not just process and analyze complex multimodal data, but also to navigate through structured tasks while coming up with creative solutions and generating knowledge. This mirrors the many-sided nature of human thinking, and the potential implications of this advancement would be huge.