Mixture of Memory Experts: Lamini Memory Tuning

Summary

Lamini introduces a novel approach to memory tuning in large language models (LLMs) called Mixture of Memory Experts (MoME), which significantly reduces hallucinations and improves factual accuracy.

Abstract

Lamini's Mixture of Memory Experts (MoME) is a groundbreaking technique aimed at enhancing the performance of LLMs by drastically reducing the rate of hallucinations from 50% to 5%. This approach, termed Lamini Memory Tuning, involves a new model architecture, Lamini-1, which eschews traditional transformer reliance for a system that dynamically retrieves facts using a vast array of memory experts. Unlike previous methods that suggest a trade-off between creativity and factuality, Lamini-1 targets near-zero training loss for critical facts, achieving state-of-the-art factual recall with just one hour of training on eight MI300X GPUs. This method offers a more computationally efficient alternative to the extensive training typically required for LLMs, by freezing the transformer backbone and training the memory experts to memorize facts, thus scaling computation costs with the number of training examples rather than the total number of parameters.

Opinions

The traditional view that hallucinations in LLMs are a result of the trade-off between creativity and factuality is challenged by Lamini's work.
The authors believe that previous approaches, including prompting and RAG+prompting, are insufficient to explain or mitigate the practical occurrence of hallucinations in LLMs.
They argue that a single epoch of training is insufficient for tasks requiring high precision in factual accuracy, which is critical for certain applications.
The paper posits that the massive computational resources typically required for training LLMs, such as 35 days for Llama 2 with 70 billion parameters, can be significantly reduced using the MoME approach.
The authors are optimistic about the potential of Lamini-1 to set a new standard for factual recall in LLMs, emphasizing the importance of parallelization and efficient training algorithms.

Introduction

I just came across a blog post by Lamini called “Introducing Lamini Memory Tuning: 95% LLM Accuracy, 10x Fewer Hallucinations”. They propose the idea of the Mixture of Memory Experts (MoME) which can be another game-changing technique after the mixture of experts and agentic AI in the direction of better AI solutions. The main goal of MoME is to reduce hallucination and they claim it can reduce hallucination from 50% to 5%, which is huge. Hallucination refers to the problem that LLMs generate information that is not presented in their training data.

Other views that are in use these days, like prompting and RAG+prompting, suggest that hallucinations are the result of a trade-off between creativity and factuality, which can be mitigated by connecting LLMs with external knowledge sources. However, this study challenges that view and shows evidence that traditional approaches can not explain why LLMs hallucinate in practice.

Proposed Solution

The paper proposes a new approach called Lamini Memory Tuning and introduces a first-generation model, Lamini-1, which relies on a massive MoME and targets near-zero training loss for key facts that should not be hallucinated. This architecture is designed to store and dynamically retrieve facts using millions of memory experts.

In this work we build on information retrieval, database system, and LLM training systems to propose Lamini-1, a model architecture eschewing transformers for knowledge retrieval and instead relying entirely on a massive mixture of memory experts (MoME). Previous work has shown how it’s possible to inject memories directly into LLMs Meng et al. (2022) . Lamini-1 allows for significantly more parallelization and can reach a new state-of-the-art in factual recall after 1 hour of training on 8 MI300X GPUs.

The authors argue that a single epoch is relevant for tasks requiring generalization and creativity, where some random choice between similar tokens is acceptable. However, it is not sufficient to achieve a high level of precision for factual tasks where getting the answer exactly right matters.

In the paper, they mention that training large language models is extremely computationally intensive. For instance, training Llama 2 with 70 billion parameters required 35 days using 2,000 A100 GPUs for just a single epoch. Achieving zero training loss on key facts would require 100 epochs, increasing the computational demand by 100 times.

To address these high costs, Lamini-1 uses a transformer backbone (like Llama2) plus a massive amount of MoMEs. They freeze the transformer backbone and then train the memory experts on a dataset to memorize the facts.

The massive MoME is designed to cut down on the amount of computation required to memorize facts. This is accomplished by the following training algorithm:

1. For a given question, select a subset of experts, e.g. 32 out of the array of one million.

2. Freeze the weights of the backbone network and the cross attention used to select the expert.

3. Take gradient descent steps until the loss is reduced sufficiently to memorize the fact.

The computation cost of memorizing each fact now scales with the number of training examples, not with the total number of parameters in the network.

The following plot shows a hypothetical case comparing the training loss of memory-tuned models against an underfit model, an overfit model, and a model that is trained to minimize generalization error.

Very interesting paper in general. Let’s see what other works will come after it.