avatarSalvatore Raieli

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

5975

Abstract

riginally defined to align the latent representation of text and images. So the authors think of aligning the latent representation of spectrograms of brain activity and sound.</p><p id="1a60"><a href="https://www.nature.com/articles/s42256-023-00714-5">The authors define</a> a new brain module for <a href="https://en.wikipedia.org/wiki/Spectrogram">spectrograms</a> (MEL or EEG). the input is the spectrogram and a vector representing the corresponding patient (one-hot encoding of the study participant). This module consists of a spatial <a href="https://en.wikipedia.org/wiki/Attention_(machine_learning)">attention</a> layer for spectrograms and a participant-specific 1 × 1 convolution (a sort of <a href="https://en.wikipedia.org/wiki/Embedding">embedding</a>). This is then followed by three convolution layers. the output of the model is the latent representation of the brain signal.</p><figure id="b314"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*m8dgod0zCAxNNANIP9bjgQ.png"><figcaption>image source: <a href="https://www.nature.com/articles/s42256-023-00714-5">here</a></figcaption></figure><p id="23c7"><a href="https://www.nature.com/articles/s42256-023-00714-5">At the same time</a>, the authors use <a href="https://arxiv.org/pdf/1904.05862.pdf">Wav2vec</a> to analyze sound and extract a representation of speech. Thus, the idea is to maximize the alignment between the latent representation of sound and brain activity.</p><figure id="1555"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*q1Ef6GoF32ZhFsFI.png"><figcaption>image source: <a href="https://www.nature.com/articles/s42256-023-00714-5">here</a></figcaption></figure><p id="5b59">Once the model was defined, <a href="https://www.nature.com/articles/s42256-023-00714-5">the authors curated a collection of MEG and EEG datasets</a> by listening to short stories (175 participants in total). They then evaluated the model in identifying the corresponding audio segments for 1,500 brain recording segments (4 different datasets).</p><figure id="eda3"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*T7wK8vWHLhOuTHhBk-aqEA.png"><figcaption>image source: <a href="https://www.nature.com/articles/s42256-023-00714-5">here</a></figcaption></figure><p id="96f1">The results obtained using MEG are significantly superior to EEG (in each case superior to a random baseline). In this case, the model achieves excellent performance for the exact segment and even better performance when considering the most likely segments.</p><figure id="1242"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*z5QtGn1IZARIQmVf.png"><figcaption>image source: <a href="https://www.nature.com/articles/s42256-023-00714-5">here</a></figcaption></figure><p id="ee50"><a href="https://www.nature.com/articles/s42256-023-00714-5">The authors also make a study</a> to understand the model's important elements. As they show a model trained with regression objective (Base model) is superior to a random model but contrastive loss (‘+ Contrastive’) improves the results. In addition, working with the latent representation of both spectrograms and sound yields much better performance</p><figure id="88a6"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*u_3-957_2NNEbtIlG-Ab1A.png"><figcaption>image source: <a href="https://www.nature.com/articles/s42256-023-00714-5">here</a></figcaption></figure><p id="ade1">The authors also note:</p><ul><li>The various components of the brain module are necessary (convolution, spatial attention module, and so on).</li><li>Generally, MEG is superior to EEG for decoding, but this is an inherent problem of the instrument (and varies depending on the EEG device that was used to record).</li><li>Also, the greater the number of participants in the study the greater the performance of the model (the model is thus able to take into account the inter-individual variability).</li></ul><h2 id="d573">What representation does the model learn?</h2><p id="3540">it is difficult to be able to understand what the model is decoding from a brain signal. This is an important issue, though, because it is related to the interpretability of the model.</p><p id="5a47">In the figure below we look at the associated probability for each participant for the phrase ‘<i>Thank you for coming, Ed</i>’. On the one hand, we can see for which participants the model had the best and worst results. <b>The question is, in case the model is wrong where does the error come from? Is the error related to the phonology or to the semantics of the sentence? Answering this question allows us to improve the model.</b></p><figure id="7741"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*DZwFbOAAhNhnUxUd.png"><figcaption>image source: <a href="https://www.nature.com/articles/s42256-023-00714-5">here</a></figcaption></figure><p id="3be3">To try to answer this question, <a href="https://www.nature.com/articles/s42256-023-00714-5">the authors analyzed predictions</a> for a single word and for the segment in which the word is contained. The authors then trained a <a href="https://en.wikipedia.org/wiki/Linear_regression">linear regressor</a> to estimate the probability of the correct word for the model. The goal is to understand through this <a href="https://en.wikipedia.org/wiki/Linear_regression">linear model</a> what factors influence the prediction of the correct word (low-level representations such as phonemes or high-level representations such as sentences). The results show that <a href="https://en.wikipedia.org/wiki/Part_of_speech">part-of-speech</a>, <a href="https://www.turing.com/kb/guide-on-word-embeddings-in-nlp">word embedding</a> and <a href="https://en.wikipedia.org/wiki/Sentence_embedding">phrase embedding</a> are correlated with the predictions. In other words, the model is more influenced by a higher level of representation. The model relies on semantic and syntactic r

Options

epresentations more than on the word representation itself.</p><figure id="6d41"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*6S2RzwUslT0IzHfK.png"><figcaption>image source: <a href="https://www.nature.com/articles/s42256-023-00714-5">here</a></figcaption></figure><p id="1e03">The authors have also released the code:</p><div id="bc6d" class="link-block"> <a href="https://github.com/facebookresearch/brainmagick"> <div> <div> <h2>GitHub - facebookresearch/brainmagick: Training and evaluation pipeline for MEG and EEG brain…</h2> <div><h3>Training and evaluation pipeline for MEG and EEG brain signal encoding and decoding using deep learning. Code for our…</h3></div> <div><p>github.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*q6Hf1IpzQL-WD-_Q)"></div> </div> </div> </a> </div><h1 id="66af">Parting thoughts</h1><figure id="afe4"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*7_uqGkNG22s1BYj8"><figcaption>Photo by <a href="https://unsplash.com/@saif71?utm_source=medium&amp;utm_medium=referral">Saif71.com</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><p id="6abc">The model succeeds in accurately identifying from brain activity segments the corresponding speech segments. This is already a remarkable achievement considering that this is noisy data.</p><p id="9101">Historically being able to analyze this data required creating complex pipelines and often dedicated to each participant. The advent of deep learning has made it possible to conduct pipelines much more nimbly. The authors here propose an end-to-end architecture that requires minimal preprocessing, making analysis much simpler.</p><p id="467b">To date, we have yet to learn how the brain represents language. This obviously impacts the result. The authors therefore exploited a model trained with large amounts of speech and aligned with a model that learns from brain signals. This is a very clever approach.</p><p id="ed37">In any case, it is still premature to think that this model can enter the clinic. The tools used are far from portable and the model still needs to be refined. In fact, the model must be capable of understanding more complex sentences and have greater accuracy.</p><h2 id="6ef4">What do think? Let me know in the comments</h2><h1 id="0d04">If you have found this interesting:</h1><p id="35a1"><i>You can look for my other articles, and you can also connect or reach me on<b> <a href="https://www.linkedin.com/in/salvatore-raieli/">LinkedIn</a>.</b></i></p><p id="b2ed"><i>Here is the link to my GitHub repository, where I am planning to collect code and many resources related to machine learning, artificial intelligence, and more.</i></p><p id="f8f5"><a href="https://github.com/SalvatoreRa/tutorial">https://github.com/SalvatoreRa/tutorial</a></p><p id="5b86"><i>or you may be interested in one of my recent articles:</i></p><div id="8f43" class="link-block"> <a href="https://levelup.gitconnected.com/lord-of-vectors-one-embedder-to-rule-them-all-205d22ca6a0a"> <div> <div> <h2>Lord of Vectors: One Embedder to Rule Them All</h2> <div><h3>Embedders are back in vogue, so why not have a universal one?</h3></div> <div><p>levelup.gitconnected.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*Bkh86ldFaAdFmsk9)"></div> </div> </div> </a> </div><div id="f59a" class="link-block"> <a href="https://levelup.gitconnected.com/mistral-7b-a-new-wind-blowing-other-language-models-b74d7bfe137e"> <div> <div> <h2>Mistral 7B: a New Wind Blowing Away Other Language Models</h2> <div><h3>Mistral 7B is more performing and faster than other LLMs</h3></div> <div><p>levelup.gitconnected.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*scRKews6_-3_0UbC)"></div> </div> </div> </a> </div><div id="c717" class="link-block"> <a href="https://levelup.gitconnected.com/scaling-data-scaling-bias-a-deep-dive-into-hateful-content-and-racial-bias-in-generative-ai-70d8aa27a631"> <div> <div> <h2>Scaling Data, Scaling Bias: A Deep Dive into Hateful Content and Racial Bias in Generative AI</h2> <div><h3>scaling seems the solution for every issue in machine learning: but it is true?</h3></div> <div><p>levelup.gitconnected.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*wdEvgclN9osEWawH)"></div> </div> </div> </a> </div><div id="5547" class="link-block"> <a href="https://levelup.gitconnected.com/grokking-learning-is-generalization-and-not-memorization-52c43c9025e4"> <div> <div> <h2>Grokking: Learning Is Generalization and Not Memorization</h2> <div><h3>Understanding how a neural network learns helps us to avoid that the model from forgetting what it learns</h3></div> <div><p>levelup.gitconnected.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*OcNYynlsjBy5NZZ2)"></div> </div> </div> </a> </div></article></body>

BRAIN | AI | BRAIN RECORDINGS |

Beyond Words: Unraveling Speech from Brain Waves with AI

AI is capable of decoding speech from non-invasive brain recordings

Photo by Robina Weermeijer on Unsplash

META AI introduces a new machine learning model capable of decoding speech from brain recordings. This model could enable thousands of people with brain injuries to return to communicating

Reading a mind as a book

Photo by Sincerely Media on Unsplash

Every year many people lose the ability to communicate. This can happen due to accidents (brain injuries), strokes, or other degenerative diseases. On the other hand, it has been estimated by the United Nations that there are more than 1 billion people with some form of disability. In recent years, brain-computer interfaces (BCIs) have been successfully used to help people reduce the effects of these disabilities. These tools have also enabled people with speech paralysis to be able to communicate (up to 15 words per minute).

BCI interface. image source: here

BCIs require an array of electrodes to be inserted into the cortex or otherwise in direct contact. Because this approach requires surgery it is obviously invasive and has risks. Scar tissue could form and body reactions to the presence of this foreign body (thus risking rejection and having to be removed).

Given the risks, several researchers have suggested decoding language from non-invasive recordings of brain activity. In general, two techniques have been proposed:

Although these instruments have become increasingly sophisticated, they still produce a noisy signal that varies widely depending on individuals and instruments. Given their complexity, researchers rather than using the signals in their original form have preferred to derive hand-crafted features.

Many studies then conducted feature extraction and only trained a model. In addition, some of these models were built for one patient at a time. This approach clearly has scalability limitations.

classical pipeline for analysis of EEG. Notice, that a model is trained only after feature extraction. Image source: here, license creative common: here

Can we decode the brain?

Photo by Robina Weermeijer on Unsplash

Recently a paper tried to create a model that can decode speech from recorded signals.

The first problem in being able to create a model that can decode speech from brain signals is that we do not know how spoken words are represented in the brain. So before they could apply a model to subjects with a disease, the authors started with healthy subjects listening to recordings in their language.

In the approaches used so far, the model has been considered as if it were a special case of regression. Instead, the authors suggest using a contrastive loss (CLIP loss). This loss function was originally defined to align the latent representation of text and images. So the authors think of aligning the latent representation of spectrograms of brain activity and sound.

The authors define a new brain module for spectrograms (MEL or EEG). the input is the spectrogram and a vector representing the corresponding patient (one-hot encoding of the study participant). This module consists of a spatial attention layer for spectrograms and a participant-specific 1 × 1 convolution (a sort of embedding). This is then followed by three convolution layers. the output of the model is the latent representation of the brain signal.

image source: here

At the same time, the authors use Wav2vec to analyze sound and extract a representation of speech. Thus, the idea is to maximize the alignment between the latent representation of sound and brain activity.

image source: here

Once the model was defined, the authors curated a collection of MEG and EEG datasets by listening to short stories (175 participants in total). They then evaluated the model in identifying the corresponding audio segments for 1,500 brain recording segments (4 different datasets).

image source: here

The results obtained using MEG are significantly superior to EEG (in each case superior to a random baseline). In this case, the model achieves excellent performance for the exact segment and even better performance when considering the most likely segments.

image source: here

The authors also make a study to understand the model's important elements. As they show a model trained with regression objective (Base model) is superior to a random model but contrastive loss (‘+ Contrastive’) improves the results. In addition, working with the latent representation of both spectrograms and sound yields much better performance

image source: here

The authors also note:

  • The various components of the brain module are necessary (convolution, spatial attention module, and so on).
  • Generally, MEG is superior to EEG for decoding, but this is an inherent problem of the instrument (and varies depending on the EEG device that was used to record).
  • Also, the greater the number of participants in the study the greater the performance of the model (the model is thus able to take into account the inter-individual variability).

What representation does the model learn?

it is difficult to be able to understand what the model is decoding from a brain signal. This is an important issue, though, because it is related to the interpretability of the model.

In the figure below we look at the associated probability for each participant for the phrase ‘Thank you for coming, Ed’. On the one hand, we can see for which participants the model had the best and worst results. The question is, in case the model is wrong where does the error come from? Is the error related to the phonology or to the semantics of the sentence? Answering this question allows us to improve the model.

image source: here

To try to answer this question, the authors analyzed predictions for a single word and for the segment in which the word is contained. The authors then trained a linear regressor to estimate the probability of the correct word for the model. The goal is to understand through this linear model what factors influence the prediction of the correct word (low-level representations such as phonemes or high-level representations such as sentences). The results show that part-of-speech, word embedding and phrase embedding are correlated with the predictions. In other words, the model is more influenced by a higher level of representation. The model relies on semantic and syntactic representations more than on the word representation itself.

image source: here

The authors have also released the code:

Parting thoughts

Photo by Saif71.com on Unsplash

The model succeeds in accurately identifying from brain activity segments the corresponding speech segments. This is already a remarkable achievement considering that this is noisy data.

Historically being able to analyze this data required creating complex pipelines and often dedicated to each participant. The advent of deep learning has made it possible to conduct pipelines much more nimbly. The authors here propose an end-to-end architecture that requires minimal preprocessing, making analysis much simpler.

To date, we have yet to learn how the brain represents language. This obviously impacts the result. The authors therefore exploited a model trained with large amounts of speech and aligned with a model that learns from brain signals. This is a very clever approach.

In any case, it is still premature to think that this model can enter the clinic. The tools used are far from portable and the model still needs to be refined. In fact, the model must be capable of understanding more complex sentences and have greater accuracy.

What do think? Let me know in the comments

If you have found this interesting:

You can look for my other articles, and you can also connect or reach me on LinkedIn.

Here is the link to my GitHub repository, where I am planning to collect code and many resources related to machine learning, artificial intelligence, and more.

https://github.com/SalvatoreRa/tutorial

or you may be interested in one of my recent articles:

Artificial Intelligence
Machine Learning
Science
Health
Programming
Recommended from ReadMedium