How the Brain and AI Overcome Forgetting

Bridging the gap: Brain-inspired solutions to address catastrophic forgetting in artificial neural networks

Memory is essential for our existence and identity. The goal of learning is to have the memory stored in the brain for rapid or intuitive retrievals as needed. It applies to everything we do in everyday life, such as skills, habits, reasoning, social interactions, and decision-making. While we constantly learn since birth, many wish we could have learned faster and remembered more. In reality, we complain frequently about forgetting, whereas, sometimes, we cannot remove the bad memories we want to forget. As we grow older, dementia becomes one of the top concerns that may affect our lives and those around us.

While artificial intelligence (AI) has taken inspiration from human brains and made remarkable advances in deep learning, one challenge artificial neural networks (ANNs) have experienced is catastrophic forgetting, where the network encounters a sudden and drastic performance drop in previously mastered tasks when learning new information. Catastrophic forgetting typically happens when an ANN is trained to learn multiple tasks sequentially. However, humans can learn large numbers of tasks throughout their lifetime without forgetting those previously learned.

Before delving into AI and catastrophic forgetting, we will first examine how the human brain acquires and holds memories by focusing on three key aspects:

Declarative and non-declarative memory with temporal time gradients: The temporal aspects of declarative and non-declarative memory provide insight into how the human brain organizes information over time.
Highly regulated and selective nature of human memory: Human memory exhibits a high degree of regulation and selectivity, influencing the retention and retrieval of information.
Memory consolidation through replays during sleep: Memory consolidation through replays during sleep sheds light on the processes that strengthen and solidify memories.

Given the complexity of the brain mechanisms to ensure remembering and learning in parallel without inference from each other, we will better understand the potential gaps between the brain and ANN architectures and explore how brain-inspired solutions mitigate catastrophic forgetting in deep learning systems.

(Note: We also need to be familiar with the concept of neuronal synapses as the building blocks of learning and memory in the brain, for which I add a quick recap below. For those who are unfamiliar with the biological background of neural networks, I recommend you read one of my previous articles, “From Biological Learning to Artificial Neural Network: What’s Next?” before moving on.)

Biological Building Blocks of Learning and Memory

The fundamental learning unit in the brain is the connection between neurons known as synapses, which transmit electrical impulses from the axons of pre-synaptic neurons to the dendrites of the post-synaptic neurons. During the learning process, the strength of involved synapses is enhanced, resulting in long-lasting (e.g., hours or longer) and more robust responses in the post-synaptic cell when exposed to the same stimuli from the input axon. This phenomenon is called long-term potentiation (LTP).

LTP was initially discovered in the hippocampus region of the animal brain. Later research proves that LTP with the strengthening of synapses is a universal learning mechanism in various animal species, including humans. Its lasting effect is the basis of memory and is realized through the underlying cellular and molecular changes surrounding the synapse. This biological mechanism inspired the breakthroughs of ANNs for deep learning, which comprise many layers of artificial neurons and interconnected weights serving as artificial synapses. These connections are adjustable, with effects similar to LTPs, through learning from extensive training datasets.

Human Memory Formation and Consolidation

Declarative and non-declarative memory with temporal gradients

To understand how we remember things, studying how we forget has provided incredible insights. In 1957, neuroscientist Brenda Milner first published the profound but selective memory loss in the patient Henry Molaison, known as H.M., after the surgery of removing his bilateral medial temporal lobes. The surgery was successful in relieving the severe epilepsy that disrupted H.M.’s everyday life. But he lost the memories acquired before the surgery, spanning from minutes to months, and some extending up to a few years ago. He also lost the capability to form new memories. Those who worked with him for the coming decades had to introduce themselves every single time they visited him.

Intriguingly, H.M.’s childhood memories remained intact. In addition, his lost memory was limited to facts, events, and knowledge he used to access consciously. The surgery did not affect HM’s motor learning and memory (e.g., riding a bicycle); his other perception and cognitive functions were also normal.

The H.M. case suggests the memory process is a distinct function in the brain. It triggered decades of intensive memory research in neuroscience. As of today, there are still many unknowns in particular in terms of the interactions between different regions and the nature of long-term storage.

At a high level, memory is categorized into three types: working memory, declarative memory, and non-declarative memory.

Working Memory is a brain function that can hold information temporarily with limited capacity. It is critical in reasoning and mentally manipulating information but distinct from the information storage process. The most famous example is that you can hold a new phone number in your mind for a limited time before your attention drifts away. H.M.’s working memory was also intact, suggesting it involves different brain regions than the medial temporal lobes.

Declarative memory is also called explicit memory because it is available for conscious recall (e.g., your last birthday celebration, the name of the US capital). It is also the memory we commonly refer to. As shown in the H.M. case, declarative memory has a time gradient based on how long it persists, depicted as short-term (seconds to hours), long-term (hours to months), and long-lasting (months to lifetime). The hippocampus region in the medial temporal lobes is responsible for the formation of short-term and long-term memories. Long-lasting memory is presumably stored in a distributed manner in the neocortex, the eventual storage in the brain independent from the hippocampus.

Non-declarative memory is expressed through behaviors beyond conscious control (e.g., playing an instrument or riding a bicycle); therefore, it is also called procedural memory or implicit memory. It includes our habit learning and automatic reflexes, such as drooling at the sight of chocolate or freezing up when seeing a bear.

Both declarative and non-declarative memories are essential to human lives and operate in parallel to support our decisions and behaviors. As revealed in the H.M. case, non-declarative memory is not stored in the hippocampus but in other regions, including the cerebellum and motor cortex. However, it has the same building blocks at the neural circuit level and a similar short-term to long-term temporal gradient as declarative memory. Hence, the subsequent discussion on memory will pertain to both types unless a noteworthy distinction requires mentioning.

The main brain regions involved in memory formation and consolidation. Image source: Wikipedia

Highly regulated and selective nature of human memory

Even though humans are bombarded with sensory information when walking into an environment, we only store a tiny proportion of it in our memory. Our attention guides the acquisition process and enables the brain to focus on the signals needed at the moment, while ignoring other irrelevant information. The attention process also couples with context switching and memory retrievals. We can get distracted by a surprise and redirect our attention away from the original object. During the attention period, we actively or automatically search for previous memories with similar information and decide if the information is entirely novel or can be associated with an earlier memory. The prefrontal cortex in the brain serves as a conscious headquarters, directing and coordinating activities, and guiding other brain regions, such as the hippocampus or cerebellum, in acquiring information into short-term memory.

Furthermore, emotions influence our attention and shape our memory. Stimuli that evoke emotions attract the most of our attention, and humans need to be motivated or rewarded to acquire information. We typically remember emotionally arousing events better than emotionally neutral events. When we recall a past event or experience, our feelings at the moment comprise an essential part of the memory.

Emotions enhance memory by involving a brain region called the amygdala (see the picture above) and several neural molecules called modulators that regulate neural plasticity in the brain. Typically, dopamine is released when motivated, and norepinephrine (NE) is released in response to sudden surprises or acute stresses (e.g., plunging into cold water). These neural modulators substantially increase neural plasticity by making them more receptive to strengthening during a single learning experience, and contribute to the enhanced formation and long-term strength of memories.

Memory consolidation through replays during sleep

Daniel Kahneman noted in his book “Thinking, Fast and Slow” that our memory is not a sum or integral of our experience. For example, when we go on a vacation or have a treatment routine in a hospital, we typically remember the most emotional moment and ending experience. The rest of the day-to-day details most likely have receded to the background or are nowhere in our memory. Our memory of an experience has been compressed and consolidated, like a novel with a climax and heartfelt ending, while the length of the period, which could be days, months, or years, does not seem to matter.

For short-term memory to be made into long-term and eventually long-lasting memory independent of the hippocampus, a consolidation process is required. This memory consolidation process occurs behind the scenes after the completion of learning behaviors, which could take days or months in humans. It involves the reciprocal connections between the hippocampus and neocortex, leading to re-organizations of the internal representation. The time-gradient nature of memory consolidation, as shown in the H.M. case, has also been confirmed in animals, with the period length varying based on the species. All the evidence shows that disruptions to the hippocampus (e.g., electrical shock or drug intervention) during this critical period could cause retrograde amnesia.

Amounting evidence also suggests that the memory consolidation is achieved through memory replays, mostly during sleep, when no external information disturbs the process. Specifically, deep non-rapid eye movement sleep (NREM), also known as slow-wave sleep, is essential for memory replays across multiple brain areas, including the hippocampus and neocortex, for the consolidation of both declarative and procedural memories.

From a neural circuit perspective, two mechanisms, namely memory replay and synaptic scaling, have been proposed to operate during sleep. The memory replay strengthens relevant synaptic connections in the cortex by reactivating internally sampled network patterns. Conversely, synaptic scaling weakens the opposite synaptic weights to enhance the signal-to-noise ratio and the brain’s capacity for new learning. These mechanisms explain why the most emotional moments and ending features survive in our memories, because they are the most salient features extracted and strengthened during reactivations. In contrast, the weaker and vague memories likely get weaker or lost because of the scaling effect.

Brain-Inspired Solutions to Address ANN Catastrophic Forgetting

In recent decades, AI has made significant strides in implementing large deep ANNs capable of learning a wide range of tasks, given sufficient training data and computing power. However, ANNs tend to severely forget previously learned tasks after learning new information. The phenomenon is called catastrophic forgetting or catastrophic interference, where the new learning disrupts or interferes with the retention of prior knowledge. In humans, in contrast, the extent of interference between different learning experiences is comparatively minor or nonexistent.

Catastrophic forgetting has nothing to do with the network’s capacity because the same network can learn many more tasks concurrently as long as the tasks are interleaved in the same training data. It occurs when the training data for a new task is presented after the previous tasks have been learned. When looking deeper into the network, the forgetting results from losing the earlier representations in the hidden layers where the weights of learned tasks are changed during subsequent learning with new objectives.

Since the 1990s, researchers have been looking for algorithms to overcome catastrophic forgetting. Notably, many have pivoted to brain-inspired solutions, which have led to promising outcomes. The approaches can be categorized into two main groups based on their achieved effects:

Network Reactivations: Replay internal representation samples by mimicking memory replays in human sleep to facilitate the retention of previously learned tasks.
Reduction of Connection Weight Overlapping: Apply local regulations in the network to reduce interference across tasks.

Now, let’s delve into specific studies exemplifying each of these categories.

Network Reactivations

The simplest method for replay is to re-apply the original training data. However, it is not a scalable solution, and neither is what happens in the brain. Instead, as inspired by human brains, the replays should use the internal representation of learning (i.e., from the hippocampus) without disturbances or interference from raw sensory input information (as during sleep.)

This study published in Nature develops a separate generative model that learns past information and outputs its learned representations to the primary network that learns every task sequentially. The model did not prevent catastrophic forgetting initially. The authors then added four brain-inspired modifications listed below, which effectively stopped the forgetting.

Merge the generative model into the primary model as top layers to participate in the forward and backward feedback. It mimics the reciprocal connectivities between the hippocampus and neocortex in the brain.
The generative model only sends a sample of the previously learned examples for replay.
The replay applies to the hidden layers of the deep neural network, not the input layers that accept the initial training data.
Add a context-dependent gating (XdG) mechanism to reduce weight overlapping across tasks by inhibiting activities of non-important nodes during replays. (Note: This XdG mechanism appears to be an essential complementary approach, which will be mentioned again in the next section.)

With these modifications, the brain-inspired replays preserve over 90% of the previously learned tasks’ accuracy, while the deep learning model learns 100 handwriting classification tasks sequentially using MNIST datasets and protocol. In contrast, without the replays, the model’s performance degrades exponentially to zero (meaning forgetting completely) after 15 tasks.

Furthermore, the replay is highly efficient and does not require extra data storage. Simply replaying one sample prevents catastrophic forgetting. As a process separate from learning, the generative model strengthens memory and prevents forgetting by sending significantly less good-enough information.

However, in a more complex scenario with image classification tasks, the replay only moderately mitigates catastrophic forgetting but does not prevent it. The result suggests that the replay itself is not enough in this type of scenario but needs to be applied in combination with the reduction of connection weight overlapping we are coming to next.

Reduction of Connection Weight Overlapping

As mentioned above, in human and animal brains, LTP triggers a series of cellular and molecular changes in the strengthened synapses, which persist over time as the neural basis of long-lasting memory. Disruptions of these structural changes would erase the corresponding memory. From a neural network perspective, it means a proportion of weights strengthened in one task should stabilize to become less plastic, which underlies the endurance of long-term knowledge without being interfered with by new memory acquisitions.

By taking this inspiration from the brain, the researchers from DeepMind designed an algorithm, called elastic weight consolidation (EWC), to protect the ANN parameter weights that contribute to the previously learned tasks. Specifically, they use Bayesian inference to derive the posterior probabilities of those parameters (note: for background knowledge on Bayesian inference, please refer to this article). A constraint is added to prevent those parameters important to the previously learned task from being modified by the newer tasks. The model stopped the catastrophic forgetting in learning ten sequential tasks using the MNIST dataset.

This study also aligns with the principles of Bayesian learning, as previously discussed in my article on Bayesian inference. Notably, the Bayesian posterior probability of a specific connection weight represents the uncertainty about that weight. The uncertainty drives both the future learning rate and its own plasticity. Lower uncertainty leads to more protection of the synaptic weight, reducing the likelihood of modification in a new environment or task. Conversely, higher uncertainty makes it more susceptible to modification through new experiences.

At the network level, this EWC mechanism does not depend on when memory is required but on the certainty of that memory. It implies the network can maintain the expertise or knowledge over time, provided the weights remain intact. It aligns with the concept of our long-lasting memory stored in the human neocortex. For example, in cases of human Alzheimer’s disease, the gradual deterioration of those enduring synapses most likely drives the devastating loss of lifelong memories.

One limitation of this study is that only ten sequential tasks were tested. In a study by scientists from the University of Chicago, the EWC stabilization effect was examined across 100 sequential tasks. Although it achieves similar effectiveness within the initial ten tasks, the performance continuously declines as the number of tasks increases.

Interestingly, the EWC effect is boosted by the context-dependent gating (XdG) mechanism that explicitly suppresses task-irrelevant connection activities (Note: this method also complements the generative model replays in the study mentioned above). XdG draws inspiration from neuroscience findings, where switching between tasks activates a non-overlapping set of neuronal dendritic branches specific to the task and inhibits the rest used for other tasks. The combination of EWC and XdG effectively prevents forgetting in the 100-task scenario by minimizing overlapping connection weights, hence, interference from the new tasks.

Other studies use different algorithms to modulate the synapse weights locally in the hidden layers of ANNs to achieve a similar effect. For example, a recent study published in Science shows that an ANN with the simulated neuro-modulator (e.g., dopamine and NE) effects makes the strengthened weights more specific in a non-linear fashion and, therefore, learns sequential tasks with no catastrophic forgetting.

In summary, the accumulating evidence suggests that strengthening a specific subset of connection weights for a learned task and minimizing their overlap across multiple tasks can effectively mitigate catastrophic forgetting in artificial neural networks. The results align with observations in the biological brain, where neural plasticity is regulated with greater robustness and precision. We expect to see future efforts to scale up these mechanisms in more extensive and deeper neural networks and apply more complex real-world use cases.

Conclusions

In recent decades, AI has achieved phenomenal progress in achieving or surpassing human performance after adopting human brain-inspired deep neural networks. These networks are primarily implemented for specific tasks and problems, requiring enormous training data and computation resources. In contrast, human brains can learn quickly and incrementally with significantly less data and remarkable energy efficiency.

Progress in addressing catastrophic forgetting in ANNs has suggested the direction of narrowing this gap. Recent studies reveal that a single brain-inspired method is insufficient to overcome catastrophic forgetting. Instead, combining multiple processes is essential to prevent forgetting, particularly when the number of tasks or complexity increases. Interestingly, those brain-inspired solutions consistently exhibit energy efficiency. Further research in this area will continue to shed light on the path for AI to achieve more efficient and generalized learning in the foreseeable future.

Conversely, a crucial insight from investigating human memory processes is that our brains not only refine our brain cells’ plasticity during learning but also actively consolidate and maintain what we have learned throughout our lifetime. Neuroscientists currently possess a limited understanding of these mechanisms. Deep learning models are valuable tools to supplement their research, facilitating theoretical frameworks, systematic tests, and network-level experiments. The ongoing convergence of these two fields promises a deeper understanding of the brain’s inner workings in learning and memory, which are foundational to human cognition and intelligence.