Mixture of Memory Experts: Lamini Memory Tuning
Introduction
I just came across a blog post by Lamini called “Introducing Lamini Memory Tuning: 95% LLM Accuracy, 10x Fewer Hallucinations”. They propose the idea of the Mixture of Memory Experts (MoME) which can be another game-changing technique after the mixture of experts and agentic AI in the direction of better AI solutions. The main goal of MoME is to reduce hallucination and they claim it can reduce hallucination from 50% to 5%, which is huge. Hallucination refers to the problem that LLMs generate information that is not presented in their training data.
Other views that are in use these days, like prompting and RAG+prompting, suggest that hallucinations are the result of a trade-off between creativity and factuality, which can be mitigated by connecting LLMs with external knowledge sources. However, this study challenges that view and shows evidence that traditional approaches can not explain why LLMs hallucinate in practice.
Proposed Solution
The paper proposes a new approach called Lamini Memory Tuning and introduces a first-generation model, Lamini-1, which relies on a massive MoME and targets near-zero training loss for key facts that should not be hallucinated. This architecture is designed to store and dynamically retrieve facts using millions of memory experts.

In this work we build on information retrieval, database system, and LLM training systems to propose Lamini-1, a model architecture eschewing transformers for knowledge retrieval and instead relying entirely on a massive mixture of memory experts (MoME). Previous work has shown how it’s possible to inject memories directly into LLMs Meng et al. (2022) . Lamini-1 allows for significantly more parallelization and can reach a new state-of-the-art in factual recall after 1 hour of training on 8 MI300X GPUs.
The authors argue that a single epoch is relevant for tasks requiring generalization and creativity, where some random choice between similar tokens is acceptable. However, it is not sufficient to achieve a high level of precision for factual tasks where getting the answer exactly right matters.
In the paper, they mention that training large language models is extremely computationally intensive. For instance, training Llama 2 with 70 billion parameters required 35 days using 2,000 A100 GPUs for just a single epoch. Achieving zero training loss on key facts would require 100 epochs, increasing the computational demand by 100 times.
To address these high costs, Lamini-1 uses a transformer backbone (like Llama2) plus a massive amount of MoMEs. They freeze the transformer backbone and then train the memory experts on a dataset to memorize the facts.

The massive MoME is designed to cut down on the amount of computation required to memorize facts. This is accomplished by the following training algorithm:
1. For a given question, select a subset of experts, e.g. 32 out of the array of one million.
2. Freeze the weights of the backbone network and the cross attention used to select the expert.
3. Take gradient descent steps until the loss is reduced sufficiently to memorize the fact.
The computation cost of memorizing each fact now scales with the number of training examples, not with the total number of parameters in the network.
The following plot shows a hypothetical case comparing the training loss of memory-tuned models against an underfit model, an overfit model, and a model that is trained to minimize generalization error.

Very interesting paper in general. Let’s see what other works will come after it.