Large language models also work for protein structures – Ars Technica

Artist's rendering of a collection of protein structures floating in space

The success of ChatGPT and its competitors is based on what’s termed emergent behaviors. These systems, called large language models (LLMs), weren’t trained to output natural-sounding language (or effective malware); they were simply tasked with tracking the statistics of word usage. But, given a large enough training set of language samples and a sufficiently complex neural network, their training resulted in an internal representation that “understood” English usage and a large compendium of facts. Their complex behavior emerged from a far simpler training.

A team at Meta has now reasoned that this sort of emergent understanding shouldn’t be limited to languages. So it has trained an LLM on the statistics of the appearance of amino acids within proteins and used the system’s internal representation of what it learned to extract information about the structure of those proteins. The result is not quite as good as the best competing AI systems for predicting protein structures, but it’s considerably faster and still getting better.

LLMs: Not just for language

The first thing you need to know to understand this work is that, while the term “language” in the name “LLM” refers to their original development for language processing tasks, they can potentially be used for a variety of purposes. So, while language processing is a common use case for LLMs, these models have other capabilities as well. In fact, the term “Large” is far more informative, in that all LLMs have a large number of nodes—the “neurons” in a neural network—and an even larger number of values that describe the weights of the connections among those nodes. While they were first developed to process language, they can potentially be used for a variety of tasks.

The task in this new work was to take the linear string of amino acids that form a protein and use that to predict how those amino acids are arranged in three-dimensional space once the protein is mature. This 3D structure is essential for the function of proteins and can help us understand how proteins misbehave after they pick up mutations or allow us to design drugs to inactivate the proteins of pathogens, among other uses. Predicting protein structures was a challenge that flustered generations of scientists until this decade, when Google’s AI group DeepMind announced a system that, for most practical definitions of “solved,” solved the problem. Google’s system was quickly followed by one developed along similar lines by the academic community.

Both of these efforts relied on the fact that evolution had already crafted large sets of related proteins that adopted similar 3D configurations. By lining up these related proteins, the AI systems could make inferences about where and what sort of changes could be tolerated while maintaining a similar structure, as well as how changes in one part of the protein could be compensated for by changes in the other. These evolutionary constraints let the systems work out what parts of the protein must be close to each other in 3D space, and thus what the structure was likely to be.

The reasoning behind Meta’s new work is that training an LLM-style neural network could be done in a way that would allow the system to sort out the same type of evolutionary constraints without needing to go about the messy business of aligning all the protein sequences in the first place. Just as the rules of grammar would emerge from training an LLM on language samples, the constraints imposed by evolution would emerge from training the system on protein samples.

Paying attention to amino acids

How this worked in practice was that the researchers took a large sample of proteins and randomly blocked out the identity of a few individual amino acids. The system was then asked to predict the amino acid that should be present. In the process of this training, the system developed the ability to use information like statistics on the frequency of amino acids and the context of the surrounding protein to make its guesses. Implicit in that context are the things that required dedicated processing in the earlier efforts: the identity of proteins that are related by evolution, and what variation within those relatives tells us about what parts of the protein are near each other in 3D space.

Assuming that reasoning about how LLMs would work is true (and Meta was building on earlier research that suggested it was), the trick to developing a working system is getting the information contained in the neural network back out. Neural networks are often considered a “black box,” in that we don’t necessarily know how they come to their decisions. But that’s becoming increasingly less true over time, as people build in features like the ability to audit the decision-making process.

In this case, the researchers relied on the LLM’s ability to describe what’s termed its “attention pattern.” In practical terms, when you give the LLM a string of amino acids and ask it to evaluate them, the attention pattern is the set of features that it looks at in order to perform its analysis.

To convert the attention pattern to a 3D structure, the researchers trained a second AI system to correlate the attention pattern for proteins where we know their 3D structures with the actual structure. Since we only have experimentally determined structures for a limited number of proteins, the researchers also used some of the structures predicted by one of the other AI systems as part of this training.

The resulting system was termed ESM-2. Once it was fully trained, ESM-2 was able to ingest a raw string of amino acids and output a 3D protein structure, along with a score that represents its confidence in the accuracy of that structure.

Good, but not the best (yet)

To test their system, the researchers tried a range of LLM sizes, varying the number of parameters that describe the strength of the connections among their nodes from 8 million up to 15 billion. A clear pattern emerged with the varying sizes. For proteins with a lot of close relatives in the training set, you didn’t need a very large LLM in order for the prediction quality to max out. For proteins that were rare or unusual in some way, performance started out low in the base LLM and improved as the size went through XL to XXL.

There was no indication that this improvement had saturated even with the 15 billion parameter, XXL system. Thus, we’re still at the point where throwing more computational resources at this will improve overall performance, even though it’s already as good as it’s likely to get for many individual proteins.

The researchers also tried two sets of test cases on ESM-2 and Google’s AlphaFold2. For one of the sets of proteins, ESM-2 was about as accurate as AlphaFold2; for the second, AlphaFold2 had better performance. In the cases where Google’s system did perform better, one of ESM-2’s internal performance-tracking measures indicated that it was having more difficulty with the protein sequence, so this wasn’t a surprise.

The tradeoff for this drop in accuracy is speed. ESM-2 skips the whole process of trying to do evolutionary alignments; all of that is built into the system during training. For a reasonably sized protein, this works out as ESM-2 being about six times faster in coming up with a structure than AlphaFold2. This allowed the team at Meta to turn it loose on a database of over 600 million proteins that had been identified in samples of environmental DNA—something the two previous systems could process, but only with significantly more time and computational expense. (For ESM-2, it took two weeks on 2,000 GPUs.)

The researchers estimate that the results, which they’ve placed online, contain high-quality structures of about 28 million proteins that have no close relatives with structures available.

Where does this leave us?

To some extent, appreciating the work here will probably have to wait until people do individual comparisons on proteins where one or the other of these AI systems performed much better than the others. That will start to give us a picture of the strengths and weaknesses of each approach. In any case, there’s some value for even the cases where the systems produce similar results; getting the same output from systems built on very different principles provides a bit more confidence in that output.

The difference in computational resources between the systems isn’t likely to be very important for this particular problem (even though it is for AI as a whole). That’s because these predictions can be made at a much faster pace than we’re identifying new proteins, so even a system that has a substantial performance penalty can wrap up existing databases in a matter of months and then can easily keep pace with discovery.

The most intriguing part of this work is that ESM-2 was still getting better as its resources increased, and it’s not clear when it would max out. It’s possible that we’d still be seeing slight improvements even as the energy and resource use made growing the system further impractical.

Science, 2023. DOI: 10.1126/science.ade2574  (About DOIs).

2023-03-16 19:01:06