How We Can Learn from The Brain to Learn How the Brain Learns
Lessons from the Oldest Data Scientist on the Planet
Data science is the art of looking at data and extracting useful knowledge from it. Data is everywhere. Data means images, audio, texts, stock market trends. Data is the mode in which the world appears to us as we are inquiring into it. Having access to useful knowledge provides an advantage over not having it, be it for the survival of the species or for the survival of a company.
And so the data scientist’s quest is as old as life earth itself: all life forms have been extracting knowledge to serve their goal of survival in one way or another from the stream of data coming in from the environment. Some for hundreds of millions, or even billions of years.
Brains have taken the lead in doing this job in the most sophisticated way.
Scientific Inquiry
The quest of science is likewise to extract knowledge from the world, albeit in a more painstakingly deliberate process., which roughly splits into two elements: building models of the world and comparing those models to data.
In more abstract terms, these two steps are similar to the two parts of a generative model like an autoencoder (I wrote more about the core ideas behind generative models here).
1. Encoder: Data →Model
In the first stage, which can be called the encoder stage, we extract some kind of representation/model from data, which we hope in some way reflects some real (causal/probabilistic, etc.) structure behind the data.
2. Decoder: Model →Predictions (new data)
The second stage decodes the model by making predictions about the world with it, which are compared in experiments to the observations.
Model Inversion
But there is another step we need to take. How do we set up and change our model if we only have little information or our predictions are off (informational uncertainty)? And what happens if our model stops being good enough because the world in which the model was set up changes or contains some kind of inherent uncertainty (environmental uncertainty)? The laws of physics might always remain the same, but what if we instead model something as fickle and uncertain as the actions of another human being?
We, therefore, need another stage, in which predictions about the world are compared to the real world and the model is adjusted to accommodate for prediction errors.
3. Model inversion: Predicted Data vs. Incoming Data → Improved Model
How do you use data to improve your model most effectively? How do you simultaneously turn the many screws of a complex model without screwing it up?
As I mentioned in the beginning, brains are really good at extracting relevant knowledge from data. When it counts, they are brilliant intuitive scientists (even though they have hundreds of biases, as I go through in my article on the connections between AI and cognitive biases). They build models of the world constantly, make predictions based on these models, and invert and improve them if their predictions are off. It is why evolution put them in our head in the first place.
Brains are highly efficient learners across a wide range of tasks. They tend to generalize much better and learn more flexibly than our current machine learning algorithms, which also means quickly adjusting to changing environments.
They make predictions about incredibly complex processes, such as “What is this person that I met a couple of minutes ago going to do next?”, which implies that our brains build a model of every person we meet, integrate this model into pre-existing models of what a person is, further integrate several data modalities (how does this person look/speak/smell/move?) to fine-tune the model, and then use this new approximate model of the person to make predictions in real-time about his or her behavior or to classify them quickly as friend or foe.
If the person acts in conflict with our prediction, our brain updates its model seamlessly, without us noticing it most of the time. But as the current standards in the fields of machine learning and artificial intelligence show, doing these kinds of things with a computer takes longer, spends much more resources and needs more data.
Learn from the Brain to Learn from the Brain

Neuroscientists have been studying the brain for more than a hundred years. But the brain is a hard nut to crack. The data we gather from it is usually messy and hard to interpret. Sometimes it’s tempting to just give up. But brains face a similar situation every day. And brains can’t afford to make excuses. If our brains had just resigned and said “the world is too complicated, it’s impossible to learn anything useful from it”, we would have died out long ago.
So instead of giving up, how can we learn good models of the brain even though we don’t have tons of clean data available?
In a gloriously unvicious circle, we can look at the subject of our study itself for guidance: we can learn from the brain how the brain learns, finding inspiration on how to improve and structure our algorithms that in turn help us analyze and model brain data and behavioral data better (I will give a more concrete example of this soon).
Hierarchical Models in the Brain
The brain infers and improves its shifting models of the world efficiently.
Hierarchical models are one way of conceptualizing this. As I went into more detail on my article on intelligence (read here), cognition can be thought of as structured in hierarchies (this also holds for thought itself, see also my article on the Geometry of Thought).
Hierarchical models are good tools for incrementally building up more and more sophisticated representations of objects. They can capture profound and non-trivial structures and probability distributions behind data and can be used to efficiently generate predictions (for a more technical introduction and examples of hierarchical models, read the intro from Penny and Henson here).
Their potential has inspired neuroscientists to look for hierarchical models in the brain. As part of Karl Friston’s famous and controversial Bayesian Brain Hypothesis, he and his colleagues propose that hierarchical models could be implemented in the human cortex, and we might actually observe in fMRI data how they get updated in real-time behavioral experiments with human beings.
We spent our lives in a world shaped by uncertainty, and there is some evidence indicating that the brain could be using implicit representations of probability distributions in the brain that could account for this uncertainty. There are theories on how these might be built into the brain on the neuronal level (see here for a nice overview of probabilities in the brain).
Different layers (as we will see) of these probabilistic models of the world could be distributed across different brain areas and in the different layers of the prefrontal cortex, so our models of the world would likewise be physically spread out over the brain.
The Gaussian Filter Model
As a simple hierarchical model that could in some form be implemented in the brain, Mathys et Al. propose the Hierarchical Gaussian Filter.
Perception always contains a level of uncertainty about the hidden states of the environment. Is there a way to straightforwardly model this uncertainty?
The aim of the Gaussian Filter is to learn about the probability structure of a hidden variable x that can change over time. This hidden variable can represent any kind of data you could think of, be it how a stock is likely to change or the decisions of your best friend.
The model represents the agent’s beliefs about how this variable behaves in time and provides a generative model that can make a prediction about the variable.
A Gaussian Filter Model is constructed out of several Gaussian random distributions stacked on top of each other. Each Gaussian has respective mean and variance, the inverse of which is, quite intuitively, called precision, because if the variance of the Gaussian is high estimates generated by it won’t be very precise.
These Gaussians carry out a discretized random walk over time, which means that the new mean of each Gaussian at time t is determined by drawing from the same Gaussian at time t-1.
To make a prediction with the model, we start out from the top and draw values from the Gaussian that sits highest in the hierarchy of N Gaussians, and which has fixed precision. The value given by that Gaussian determines by some function the precision of the Gaussian one layer below it, from which we then draw. This Gaussian, in turn, determines the respective covariance of the Gaussian below.
Rinse, wash and repeat until we reach the bottom layer.
The hidden variable x that we are trying to predict is connected to the bottom layer of the hierarchy of Gaussians. In case the variable is binary (taking on Boolean values like “yes” or “no”, as in decision processes) the bottom Gaussian can be connected to a function like a unit-square sigmoid. If x is itself assumed to be Gaussian, the bottom Gaussian straightforwardly models the probability distribution of the random variable x.
This might all seem a bit abstract as of yet (also see the original paper for a figure), so let’s get some intuition on what is going on here.
Learning Behavior with the Gaussian Filter Model
The Gaussian Filter model is not only proposed to be implemented in some way in the brain, but can be “turned around” to analyze data from behavioral experiments, modeling how real brains learn in real life.
Let’s take the simplest case of two Gaussians stacked on top of each, which are connected via a unit square sigmoid function to a Boolean variable/decision.
Say we are looking to figure out if an agent is choosing “yes” or “no”, which is determined by some hidden states. These we can think of as internal processes in the agent we are observing, like his thoughts and motivations. They can of course change over time (I assume you have at one point said yes to something and deeply regretted saying yes later on after your model of the world changed).
The Gaussian at the bottom layer encodes the tendency of the agent towards “yes” or “no”. The second Gaussian at the layer above now models the volatility of the bottom Gaussian: how strongly is the tendency of the agent towards yes or no varying in time, and how confident in turn are we about our prediction that the agent will say yes or no?
If we stack additional Gaussians on top of the two existing Gaussians, they will model the volatility of the volatility, and so on and so on, allowing the model to capture more and more complex probability distributions of the hidden states in the agent.
Model Inversion in Hierarchical Models
We are not done yet. As I laid out in the beginning, the third and crucial step of every model is model inversion.
Neural Networks are usually trained by backpropagation. You take the gradient of the loss function (which in a sense is nothing else than a prediction error) and adjust the parameters of the network in order to make the loss smaller.
The model inversion in the Gaussian Filter Model is somewhat similar in structure. After a prediction is made, its layers are updated by minimizing a variational free energy (a lower-bound on the surprise, in the language of the Bayesian Brain Hypothesis) through propagating prediction errors, weighted by the inverse precision, upwards through the layers of the model.
By making a mean-field-approximation (assuming the distributions stay Gaussian and are independently updated), this scheme is relatively simple and computationally efficient, and can be carried out trial-wise and in real-time.
The parameters are adjusted based on the respective prediction error. The fact that you are starting from the bottom follows exactly the same logic as in backpropagation, where you also start in the layer of the network connected to the output.
The means and precisions of the Gaussians are updated based on the size of the prediction error. Think of it this way: if the prediction was really good (matched the observation almost perfectly), the error that is propagated is small and will become smaller and smaller once you move up the hierarchy.
If the prediction was bad, but the precision of the guess was really small, to begin with (which means the model was highly uncertain about its prediction on this level), the model also doesn’t get adjusted too strongly because it already assumed its prediction would be uncertain and probably off.
Learning to Learn how the Brain Learns
Does this model really show us how the brain learns? The answer is still uncertain (the question is how our brains model this uncertainty), and the fact that we can successfully use the model to learn behavior does not show that the brain really implements it this way.
Nevertheless Iglesias et. Al claim to have found evidence from neuroimaging studies that it could be doing just that, observing prediction errors being propagated into different brain regions/hierarchies, depending on the size of the prediction errors, and in turn linking them to different neurotransmitters, such as dopamine, involved in reward prediction.
Friston further proposes, based on ideas from Mumford, that this model can be linked to neuroanatomy: predictions might be passed by deep pyramidal cells, while prediction errors are encoded, for example., by superficial pyramidal cells.
But there remain many open questions, such as how the Gaussians of the model would be realized anatomically, how their means and covariances would be updated based on signals from the pyramidal cells, how the prediction error could be computed, etc.
We can not only learn how brains learn about the world by predicting their behavior with models inspired by the brain.
But in a more general sense, hierarchical models are useful tools in many deep learning and data science applications, because of their power in structuring inference networks (for example for amortized inference of dynamical systems, in natural language processing or for learning better approximate posteriors in variational autoencoders) and can help make them more interpretable. As our own perception of the world is hierarchical, having hierarchical models could ease connecting them to our intuitions and everyday language.
As I also explained in my article on Recurrent Neural Networks, building more structured models of the brain is a crucial step in understanding how the brain organizes itself, which can help us make more sense of data from brain measurements like fMRI, EEG, etc..
Learning unsupervised, generative models of data looks like an important step in the development of AI. Hierarchical approaches, I believe, are one of the more promising ways of thinking about it.
