avatarSalvatore Raieli

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

5465

Abstract

ow-computer-modeling-simulations-and-artificial-intelligence-impact-protein-engineering-in-4d8473bd59ff">protein engineering</a> methods are usually conducted using iterative processes of <a href="https://en.wikipedia.org/wiki/Mutagenesis">mutagenesis</a> and selection. During these laborious steps, an attempt is made to select proteins for particular properties of interest. There are also other methods of drawing models in silico, using biophysical properties or from evolutionary studies of the sequence. These methods, however, are often laborious and not always successful.</p><p id="54c9">ProGen is a <a href="https://en.wikipedia.org/wiki/Language_model">language model</a> (a <a href="https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)">transformer</a>) that is optimized to predict the probability of a certain amino acid given the previous one in the sequence. In other words, the model takes as an input a protein sequence and predicts the next amino acid in the sequence (let’s say XXYZ, given the model X it learns to predict X, then with XX it learns to predict Y, and so on). This approach is called unsupervised learning and allows the model to understand the patterns and properties of the data.</p><p id="6b98">As can be seen during training, no structural information or other assumptions about the evolution of a protein family are provided. At the same time, however, the model is capable, through <a href="https://en.wikipedia.org/wiki/Unsupervised_learning">unsupervised learning</a>, of understanding some of the structural and functional properties of a protein that are hidden in the sequence.</p><p id="ff10">Once trained, the model can be used to generate sequences:</p><blockquote id="d5fa"><p>After training, Progen can be prompted to generate full-length protein sequences for any protein family from scratch, with a varying degree of similarity to natural proteins. In the common case where some sequence data from a protein family is available, we can use the technique of fine tuning pretrained language modelswith family-specific sequences to further improve the ability of Progen</p></blockquote><p id="77c3">Meanwhile, ProGen is a much smaller model than AlphaFold2 (only 1.2 billion parameters) and second, it can be conditioned to generate particular types of sequences (‘tags’). These tags can represent concepts such as protein family, biological process, or molecular function.</p><figure id="4ff4"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*T7tdHeQotRG8Z2-6.png"><figcaption>3d structure of lysozyme (<a href="https://it.wikipedia.org/wiki/Lisozima#/media/File:Lysozyme.png">source</a>)</figcaption></figure><p id="464c">The authors decided to test it on five lysozyme families. The ProGen model was trained with all the known protein sequences, then they selected 55,000 sequences for fine-tuning. After that, they generated one million sequences with the model:</p><blockquote id="8c15"><p>Our artificial lysozymes span the sequence landscape of natural lysozymes across five families that contain diverse protein folds, active site architectures and enzymatic mechanisms. As our model can generate full-length artificial sequences within milliseconds, a large database can be created to expand the plausible sequence diversity beyond natural libraries</p></blockquote><p id="fd38">As the authors note the model captured in these sequences evolutionary conservation patterns without the need to indicate this information to the model.</p><p id="52a1">They subsequently selected one hundred sequences (using divergence from natural sequences as a criterion) to express as proteins and to do functional tests. As the authors noted, “Artificial proteins included specific amino acids and pairwise interactions never before observed in lysozyme family-specific alignments.” An interesting result that shows how AI can generate sequences different from those observed. <b>The question remains: are these proteins functional?</b></p><figure id="87e9"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*eFrTJhuJjSeuVS6q.jpg"><figcaption>Gram-positive bacteria: lysozyme function is to degrade the peptidoglycan, the major component of gram-positive bacteria membrane. This compromises the integrity of bacterial cell walls. image source (<a href="https://en.wikipedia.org/wiki/Gram-positive_bacteria">here</a>)</figcaption></figure><p id="9035">The authors synthesized these candidate proteins in the laboratory, and were able to obtain a quality sample in 72 percent of cases -a quite a high success rate! They also tested these proteins using quench release of fluorescein-labeled <i>Micrococcus lysodeikticus</i> cell wall, an assay to measure lysozyme activity. The rather impressive result:</p><blockquote id="154b"><p>Among our artificial proteins, 73% (66/90) were functional and exhibited high levels of functionality across families. The representative natural proteins exhibited similar levels of functionality with 59% (53/90) of total proteins considered functional.</p></blockquote><p id="32cb">Translated into simple words, the proteins that were generated by AI show comparable activity to natural lysozymes. In addition, the authors note that there are particular outliers:</p><blockquote id="13e7"><p>These highly active outliers demonstrate the potential for our model to generate sequences that may rival natural proteins that have been highly optimize

Options

d through evolutionary pressures.</p></blockquote><p id="0c19">Finally, the authors tested the model with other families and conducted ablation studies. The latter showed that both initial training and fine-tuning are not necessary to obtain an optimal result.</p><blockquote id="b6f4"><p>Training with the universal sequence dataset containing many protein families enables ProGen to learn a generic and transferable sequence representation that encodes intrinsic biological properties. Fine tuning on the protein family of interest steers this representation to improve generation quality in the local sequence neighborhood.</p></blockquote><p id="fcb4">This and similar tools are paving the way to a new revolution based on the possibility of creating new proteins. In fact, this permits the generation of new protein sequences in a millisecond. In the future, we can adapt this model to many other kinds of protein families. Moreover, the authors used tags to fine-tune and generate protein (in this case lysozyme) but this tag can be a specific function (a specific reaction, substrate, and so on). For example, we could in the near future see new enzymes to digest pollutants, fight parasites and bacteria, act as vaccines, or be useful as drugs in clinical applications.</p><h1 id="31b3">If you have found this interesting:</h1><p id="b73e">You can look for my other articles, you can also <a href="https://salvatore-raieli.medium.com/subscribe"><b>subscribe</b></a> to get notified when I publish articles, and you can also connect or reach me on<b> <a href="https://www.linkedin.com/in/salvatore-raieli/">LinkedIn</a>. </b>If you want to support me, <b>please clap and share</b>, or you can also <b>sign up <a href="https://salvatore-raieli.medium.com/membership">here</a></b> (I’ll earn a small commission at no extra cost to you).</p><p id="4c18">Here is the link to my GitHub repository, where I am planning to collect code and many resources related to machine learning, artificial intelligence, and more.</p><div id="afb1" class="link-block"> <a href="https://github.com/SalvatoreRa/tutorial"> <div> <div> <h2>GitHub - SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…</h2> <div><h3>Tutorials on machine learning, artificial intelligence, data science with math explanation and reusable code (in python…</h3></div> <div><p>github.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*m5-ij2FFe7lFKEtw)"></div> </div> </div> </a> </div><p id="8115">or you may be interested in one of my recent articles:</p><div id="1e30" class="link-block"> <a href="https://pub.towardsai.net/this-is-your-brain-on-code-ad24b55c16dd"> <div> <div> <h2>This Is Your Brain On Code</h2> <div><h3>New research highlights what happens in the brain while coding</h3></div> <div><p>pub.towardsai.net</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*GoH9TwI5dEZ7E7Zc.png)"></div> </div> </div> </a> </div><div id="c34a" class="link-block"> <a href="https://readmedium.com/everything-but-everything-you-need-to-know-about-chatgpt-546af7153ee2"> <div> <div> <h2>Everything but everything you need to know about ChatGPT</h2> <div><h3>what is known, the latest news, what it is impacting, and what is changing. all in one article</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*RzPI5E3ygDypkyls.png)"></div> </div> </div> </a> </div><div id="5cd0" class="link-block"> <a href="https://readmedium.com/the-decline-of-disruptive-science-730cc3fe28b1"> <div> <div> <h2>The decline of disruptive science</h2> <div><h3>We are publishing more than ever but we are now less innovative: why?</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*BQr9vdcarUkSZHlA9B0E9A.png)"></div> </div> </div> </a> </div><div id="5d8e" class="link-block"> <a href="https://readmedium.com/deep-learning-can-tell-if-you-are-above-the-drinking-limit-40bea9205878"> <div> <div> <h2>Deep learning can tell if you are above the drinking limit</h2> <div><h3>A new algorithm that can measure your alcohol consumption from your speech</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*0imu9hMF5VXzedER.jpg)"></div> </div> </div> </a> </div><p id="bf6e"><i>Edited by Luciano Abriata (<a href="undefined">LucianoSphere</a>)</i></p></article></body>

AI enables designing new proteins from scratch

How artificial intelligence can allow producing unseen proteins

image made by the author using stable diffusion and an image of a protein

Since the publication of AlphaFold2, scientists have speculated what the fallout might be from a model that predicts protein structure. Now a new study shows how to use a language model to generate functional enzymes that also display some functional activity.

Alphafold2 showed how using DeepLearning made it possible to successfully predict the structure of a protein from its sequence (here’s for example a tutorial where you can run it yourself). Although this was considered an outstanding achievement, the model is not without limitations.

Meanwhile, DeepMind is not the only company that has shown interest in the possibility of predicting protein structure. Both Meta (former Facebook) and SalesForce (American cloud-computing company) have developed their own models, using a classical language model (transformer) modified for the occasion. The result is that these models are slightly less accurate than AlphaFold but are much faster (and also less computationally expensive).

The authors of SalesForce present in their latest work ProGen, a model that was trained on 280 million proteins from over 19,000 families. In addition, this model can be fine-tuned on a specific family and conditioned to generate new sequences. In this study, the authors generated new lysozyme-like proteins that showed indeed catalytic activity just like natural lysozyme. The key thing is that these are not simply variations of natural sequences but rather totally redesigned; in fact, these new enzymes have sequence similarity as low as 31 %.

lysozyme crystal, X-ray crystallography is the elected method to obtain the structure of a protein (source)

You can find the original article here:

Traditional protein engineering methods are usually conducted using iterative processes of mutagenesis and selection. During these laborious steps, an attempt is made to select proteins for particular properties of interest. There are also other methods of drawing models in silico, using biophysical properties or from evolutionary studies of the sequence. These methods, however, are often laborious and not always successful.

ProGen is a language model (a transformer) that is optimized to predict the probability of a certain amino acid given the previous one in the sequence. In other words, the model takes as an input a protein sequence and predicts the next amino acid in the sequence (let’s say XXYZ, given the model X it learns to predict X, then with XX it learns to predict Y, and so on). This approach is called unsupervised learning and allows the model to understand the patterns and properties of the data.

As can be seen during training, no structural information or other assumptions about the evolution of a protein family are provided. At the same time, however, the model is capable, through unsupervised learning, of understanding some of the structural and functional properties of a protein that are hidden in the sequence.

Once trained, the model can be used to generate sequences:

After training, Progen can be prompted to generate full-length protein sequences for any protein family from scratch, with a varying degree of similarity to natural proteins. In the common case where some sequence data from a protein family is available, we can use the technique of fine tuning pretrained language modelswith family-specific sequences to further improve the ability of Progen

Meanwhile, ProGen is a much smaller model than AlphaFold2 (only 1.2 billion parameters) and second, it can be conditioned to generate particular types of sequences (‘tags’). These tags can represent concepts such as protein family, biological process, or molecular function.

3d structure of lysozyme (source)

The authors decided to test it on five lysozyme families. The ProGen model was trained with all the known protein sequences, then they selected 55,000 sequences for fine-tuning. After that, they generated one million sequences with the model:

Our artificial lysozymes span the sequence landscape of natural lysozymes across five families that contain diverse protein folds, active site architectures and enzymatic mechanisms. As our model can generate full-length artificial sequences within milliseconds, a large database can be created to expand the plausible sequence diversity beyond natural libraries

As the authors note the model captured in these sequences evolutionary conservation patterns without the need to indicate this information to the model.

They subsequently selected one hundred sequences (using divergence from natural sequences as a criterion) to express as proteins and to do functional tests. As the authors noted, “Artificial proteins included specific amino acids and pairwise interactions never before observed in lysozyme family-specific alignments.” An interesting result that shows how AI can generate sequences different from those observed. The question remains: are these proteins functional?

Gram-positive bacteria: lysozyme function is to degrade the peptidoglycan, the major component of gram-positive bacteria membrane. This compromises the integrity of bacterial cell walls. image source (here)

The authors synthesized these candidate proteins in the laboratory, and were able to obtain a quality sample in 72 percent of cases -a quite a high success rate! They also tested these proteins using quench release of fluorescein-labeled Micrococcus lysodeikticus cell wall, an assay to measure lysozyme activity. The rather impressive result:

Among our artificial proteins, 73% (66/90) were functional and exhibited high levels of functionality across families. The representative natural proteins exhibited similar levels of functionality with 59% (53/90) of total proteins considered functional.

Translated into simple words, the proteins that were generated by AI show comparable activity to natural lysozymes. In addition, the authors note that there are particular outliers:

These highly active outliers demonstrate the potential for our model to generate sequences that may rival natural proteins that have been highly optimized through evolutionary pressures.

Finally, the authors tested the model with other families and conducted ablation studies. The latter showed that both initial training and fine-tuning are not necessary to obtain an optimal result.

Training with the universal sequence dataset containing many protein families enables ProGen to learn a generic and transferable sequence representation that encodes intrinsic biological properties. Fine tuning on the protein family of interest steers this representation to improve generation quality in the local sequence neighborhood.

This and similar tools are paving the way to a new revolution based on the possibility of creating new proteins. In fact, this permits the generation of new protein sequences in a millisecond. In the future, we can adapt this model to many other kinds of protein families. Moreover, the authors used tags to fine-tune and generate protein (in this case lysozyme) but this tag can be a specific function (a specific reaction, substrate, and so on). For example, we could in the near future see new enzymes to digest pollutants, fight parasites and bacteria, act as vaccines, or be useful as drugs in clinical applications.

If you have found this interesting:

You can look for my other articles, you can also subscribe to get notified when I publish articles, and you can also connect or reach me on LinkedIn. If you want to support me, please clap and share, or you can also sign up here (I’ll earn a small commission at no extra cost to you).

Here is the link to my GitHub repository, where I am planning to collect code and many resources related to machine learning, artificial intelligence, and more.

or you may be interested in one of my recent articles:

Edited by Luciano Abriata (LucianoSphere)

Artificial Intelligence
Science
Biology
Technology
Machine Learning
Recommended from ReadMedium