One Lab Wrote the Code to DeepMind's Protein AI without Coding
Get your own

One Lab Wrote the Code to DeepMind’s Protein AI without Coding Get your own

One Lab Wrote the Code to DeepMind's Protein AI without Coding
Get your own
Although the Google company solved a basic problem in biology, they didn’t immediately share their solutions. A University of Washington team attempted to replicate it.

For biologists who study the structure of proteins, the recent history of their field is divided into two epochs: before CASP14, the 14th biennial round of the Critical Assessment of Protein Structure conference, and after. Scientists had worked for decades to figure out how to determine the structure of proteins based on the amino acid sequence. After CASP14, which took place in December 2020, the problem had effectively been solved, by researchers at the Google subsidiary DeepMind.

DeepMind is a research firm that focuses on “deep learning”, a type of artificial intelligence. DeepMind was previously famous for creating an AI system which beat Go’s world champion. Their success in protein structure prediction was achieved with a neural network called AlphaFold2. This marked the first time that they were able to solve an actual scientific problem. Scientists can use their ability to predict the structure of proteins to aid research and drug discovery. DeepMind released their code to the public on July 15th, when Nature published an unedit manuscript that detailed DeepMind’s model.

In the time that CASP was published, however, another team took over this role. A team headed by David Baker (director of the Institute for Protein Design, University of Washington), released their model for protein structure prediction in June. This was a month prior to DeepMind’s manuscript being published. RoseTTAFold was for a month the best-performing protein prediction algorithm available to scientists. Though it did not reach the same peaks of performance as AlphaFold2, the team ensured the model would be accessible to even the least computationally inclined scientist by building a tool that allowed researchers to submit their amino acid sequences and get back predictions, without getting their hands dirty with computer code. The Baker Lab paper that described RoseTTAFold was published by Science a month later.

RoseTTAFold2 and AlphaFold2 both have multilayered complex neural networks which can predict 3D structures of a protein based on its amino acid sequence. They also share interesting similarities such as a multi-track structure, which allows them to examine different parts of the protein structure independently.

This is no accident. The University of Washington team created RoseTTAFold from ideas from DeepMind’s short presentation at CASP in which they described the unique elements of AlphaFold2. They were inspired also by the lack of information from the DeepMind team about when scientists would have access to the technology. Researchers were concerned that the code might be kept secret from others by a private firm, in contravention of academic practices. Baker says, “Everyone was stunned, there were a lot press and then it became radio silence basically.” You’re stuck in a strange situation, where you’ve made a significant advance in your field but can’t continue to build upon it.

Baker and Baek had no idea when or if the DeepMind team would make their tool accessible to structural biologists. They decided to create their own.

Figuring out the three-dimensional structure of proteins is essential to understanding the inner workings of cells, says Dame Janet Thornton, director emeritus of the European Bioinformatics Institute. She says that although DNA is code for all things, it does not do everything. It’s proteins that do the most work. Scientists use a range of techniques to determine the structure of protein structures, but it’s sometimes not enough information to give a definitive answer.

Researchers can use a computer model to determine the meaning of confusing data by using a unique sequence amino acid sequence for proteins. CASP provides scientists with a method to assess the effectiveness of their algorithms over the last 27 years. Thornton states that the progress made has been steady, but slow. She continues that AlphaFold2 was a significant improvement, and more dramatic than any other program in many years. It was, in this respect, a significant step forward.”

The Baker Lab had achieved the second-best performance at CASP14 with a model of their own, which gave them a solid place to start when it came to reproducing DeepMind’s method. After comparing what DeepMind members said about AlphaFold2 with their approach, they began to build a model based on each of these important advances.

They were able to adopt a crucial innovation: a multitrack network. The majority of neural networks process data and then analyze it along one “track,” which is a path that runs through the network. Each layer of simulation “neurons”, however, transforms the outputs from the preceding layer. This is a little like the telephone players who transform the sounds they hear into words that they speak into the ears of others. However, in a neural network, information is gradually transformed into more usable form than it was in the game.

DeepMind created AlphaFold2 in order to separate different parts of the protein structure information into separate tracks that each fed some of it back. It was like two games of telephone occurring simultaneously with nearby players passing some of their information back and forth. RoseTTAFold was the best, according to Baker and Baek.

Baek points out that when you are drawing complicated figures, it is not possible to draw them all in one go. You will start with very basic sketches and add pieces as you go. This process is similar to protein structure prediction.

Baker and Baek wanted to see RoseTTAFold in action. They reached out at structural biologists, who couldn’t resolve protein structure issues. David Agard from the University of California in San Francisco sent the sequence of amino acids for the protein that was produced by bacteria infected. By 1 AM, the structure predictions were back. RoseTTAFold solved the problem, which had been ailing Agard for over two years. It took RoseTTAFold six hours to complete it. Agard states, “We were able to see the evolution of this protein from two bacteria enzymes that probably evolved millions of years ago.” Agard’s lab can now move on to figuring out the mechanism of this protein.

Although RoseTTAFold had not reached the same level of performance that AlphaFold2, Baker & Baek realized it was now time for them to unleash their tools into the wider world. Baker states that the tool was still very valuable because it helped solve biological problems which had in some cases been known for a while. They decided that it was important for scientists to have the information and access to the model.

John Jumper (who leads the AlphaFold program) revealed that DeepMind had submitted their manuscript to Nature on May 11. DeepMind submitted the manuscript to Nature in May 11.

The scientific community didn’t know much about DeepMind’s timeline at that time. Three days later, Baker’s preprint was made available. On June 18, DeepMind CEO Demis Hassabis posted to Twitter. He wrote, “We have been working hard on our full method paper with open source code accompanying it and providing wide free access to AlphaFold to the scientific community.” More to come!

Nature published DeepMind’s unedit, but peer-reviewed AlphaFold2 manuscript on July 15. Simultaneously, DeepMind made the code for AlphaFold2 freely available on github. A week later, DeepMind released a massive database of 350,000 proteins structures they had predicted using their algorithm. Scientists now have access to the revolutionary tool for protein prediction and an enormous amount of predictions.

Jumper says there is a simple reason DeepMind’s code and paper were not released for seven months following the CASP presentation. “We weren’t ready to open-source or publish this very detailed paper that day. Jumper states that the paper was released as quickly as possible after the May paper had been submitted. The team was still in the process of peer reviewing the papers. He says, “We were honest trying to push as hard as possible.”

DeepMind’s manuscript was published using Nature’s Accelerated Author Preview workflow. This is the most used for Covid-19 papers. A spokesperson from Nature stated that the process was intended to be a “service to our authors, and to our readers, in order to make particularly time-sensitive, peer-reviewed research as quick as possible.”

Pushmeet and Jumper, the science lead for DeepMind, disagree on whether Baker’s article was a factor in their Nature publishing. Kohli states, “From our point of view, we submitted and contributed the paper in May. So it was out our control, in some way.”

CASP organizer Moult thinks that DeepMind scientists may have been able to convince their parent company, the University of Washington, to release their research on a faster timeline. Moult said that he has a feeling of trust with them, as they are outstanding scientists. There is tension because it’s commercial and it has to make money. Alphabet holds the number four spot in market capitalization.

Hassabis describes the AlphaFold2 release as an advantage to the scientific community and Alphabet. In an interview with WIRED, he stated that this is open science. He said, “This system, code and database are free for all to use. When asked if there had been any discussions about keeping the code secret for commercial purposes, he replied, “It is a great question how do we deliver value.” There are many ways to deliver value, but it is possible. There’s obviously one, which is commercial. But there are also prestige and other ways to deliver value.”

Baker praises the DeepMind team’s thoroughness in releasing code and paper. RoseTTAFold, Baker says was in a way a safeguard against DeepMind not acting in scientific collaboration. He says, “If they were less educated and had decided to not [release the code], then at least there would have been something for the world’s to build upon.”

He feels, however, that had the information been available earlier, the team would have been able to push AlphaFold2 even further or to adapt it for the task of creating artificial proteins. This is Baker Lab’s primary focus. Baker states, “There is no doubt that, when they said, “Here’s the code and this’s how we did it,” he would have been a lot further ahead.

Some of these real-world uses of protein structure prediction could require a lot of time. Scientists could be able to identify the 3-dimensional structure of the protein essential for the survival of the pathogen and help them develop drug therapies to combat it. The applications could even extend to the pandemic; for example, DeepMind used a version of AlphaFold2 to predict the structures of some SARS-CoV-2 proteins last August.

Baker believes that the need to share information between industry and academia will grow. Artificial intelligence problems require immense time and resources. Companies like DeepMind can access computing power and personnel on a scale that is unimaginable in a university laboratory. Baker states that it is almost certain that major breakthroughs will be continued to be made by companies and this trend will only increase. There will be pressure within those companies about whether or not to publish the advancements, as DeepMind did, and how to monetize them.

Additional reporting from Will Knight.

More Great WIRED Stories

Publited at Thu, 12 August 2021 11:25.33 +0000

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.