Unleash artificial intelligence on old measurement results and you won’t believe what it adds to your knowledge. This is the promise of systems biologist Lennart Martens, who is a staunch opponent of blinkers.

Unleash artificial intelligence on old measurement results and you won’t believe what it adds to your knowledge. This is the promise of systems biologist Lennart Martens, who is a staunch opponent of blinkers. 

Things went awry when he was given a Commodore 64 computer at the age of eight, Lennart Martens chuckles. ‘After two weeks I was programming in Basic and since then I am a computer nerd in heart and soul.’ The fact that he did not study computer science, he owes to the influence of a few excellent chemistry and biology teachers – and a certain lack of enthusiasm for theoretical mathematics. 

Within the VIB-UGent Center for Medical Biotechnology, Martens heads the computational omics group, abbreviated CompOmics. ‘High-throughput techniques such as genomics, transcriptomics, proteomics and metabolomics yield so much information that you need software to plough through it’, he explains. It is a very mixed group of computer scientists, physicists, bioengineers and biomedical engineers: some write algorithms, others apply them. But classical bioinformatics, which interprets fresh experimental data, is not the main focus. Instead, they reuse archived data to build models that predict which analytical procedure to use in the next experiment. 

‘When you are exploring, you don’t want to rule anything out in advance’ 

Martens can imagine that not all experimenters appreciate it equally: ´We are going to tell people who are doing the work how they can do it better. But I see it as an inspiration for those who use their own expertise to make the real thing out of the proposed solutions. We are not yet going to throw the human capacity for interpretation into the dustbin.’ 

What does such a prediction entail? 

‘Especially liquid chromatography and mass spectrometry. If you load your analyte at a certain LC, how long is it going to take approximately to come off your column? And approximately what pattern of peaks does it cause in the MS afterwards? The original idea was that you can interpret your data better that way. But you can also use it to say in advance what a nice LC protocol is if you want to look at certain molecules. The column, the gradients, the solvents, the whole lot. It works with proteins and with small molecules. With the latter, we are already working together with a pharmaceutical company: how do you best deal with the routine control of production? 

We have been working on this for a long time and we are certainly not alone. But thanks to the rise of machine learning, it has gained momentum. With such algorithms you can promise people: try that chromatography, that column and that gradient and then you will get an optimal separation. Based on all the historical chromatography of all the molecules you have ever done. Analytical chemistry is a very conservative field but there, too, people are starting to see that a big data approach can facilitate and speed up the work.’ 

Continued below image

Lennart Martens

Beeld: Bart Cloet

Machine learning has a reputation for sometimes reasoning towards certain conclusions. Aren’t you afraid of confirmation bias? 

‘Yes, you have to pay very close attention. In quality control it doesn’t really matter, then you usually know pretty well which canary-in-the-coal-mine molecules you expect to see when something goes wrong with production. But if you are exploring, for example looking in plasma for proteins that are potential biomarkers, you don’t want to exclude anything in advance. 

That is precisely what makes machine learning algorithms so interesting. What they can do is abstract. Look at what you have already seen, which is biased. But they can also learn the underlying principles of the behaviour of those things in your instruments, and apply them to molecules you have never seen before. Then, for example, you can say: I will take the entire proteome, all possible proteins and pieces thereof including the ones we have not seen yet, and I will predict how they will behave. In concrete terms, we have developed ionbot for this purpose, a search engine that tries to identify all the proteins in a sample on the basis of LC and MS data’. 

Such software has been around for some time, hasn’t it? 

‘Yes, but until now it got confused if the protein fragments showed too much variation due to phosphorylation or other changes that occur after protein synthesis. Such post-translational modifications are everywhere and, to make matters worse, there are also artefacts, modifications without a biological function. Sometimes these are accidents in the cell, for example when a reactive intermediate does not neatly follow its reaction pathway but modifies the enzyme. It can also be a side effect of the protocols you use to prepare your samples, such as shielding the ends after breaking through sulphur bonds. In the MS, all those extra masses count. 

We looked at such data for 20 years with blinkers on. We knew the modifications were there, but out of pure pragmatism we swept them under the carpet. Machine learning puts an end to this kind of bias. For the first time, our ionbot can take all the post-translational modifications with it. And we were a little shocked because there are many more than we thought.’ 

‘A certain modification was known from fifty proteins, whereas we had already seen it 2,500 times’ 

So the ionbot recognises the protein chain among the modifications? 

‘The MS indicates when a protein fragment contains a phosphorylation or a methylation, or an artefact such as an oxidation. You can map that for the whole protein and then you see that some chains can carry a lot of modifications. We have now downloaded a billion human MS spectra from the PRIDE proteomics database that I started at EMBL-EBI in Cambridge and which has grown to be the largest of its kind in the world. We are looking at these spectra again with the blinkers off, and for the first time ever this will give us a hyper-detailed insight into what we have missed so far.’ 

Do they include modifications that you can link to biological phenomena? 

‘Some do, because they have already been studied in the classic biochemical way. Take the differences in phosphorylation between normal tissue and tumour tissue. When we look at them, we find what is already known. But we also see many new things, and modifications that have never been studied. Recently, someone asked me about a specific modification that plays a role in the innate immune system. He knew of about 20 proteins in which this occurs, and fell off his chair when we turned out to have collected 1,500 of them. Another modification was known from 50 proteins while we had already seen it 2,500 times. If we go public with this data, the shock value could become the biggest problem. People who say: that’s not possible, we refuse to believe that there is that much on proteins.’ 

Continued below image

Lennart Martens 2 klein

Beeld: Bart Cloet

Going public, you say. That still has to happen? 

‘Many people have already received our data. If someone asks for something, we send them what we have. But we are still struggling with making it visible. As a prototype we have built the Scop3P website, which only displays phosphorylation. It shows linear protein sequences and 3D structures showing where the phosphorus is. The idea was to make a Scop3PTM with all post-translational modifications. But there are so many of them in a row that it becomes confusing.’ 

Can you also do something with the artefacts? 

‘For starters, we can look back at which protocols were used, to see which ones cause the least damage. But biological artefacts also contain information that can be very useful. If a substance unintentionally modifies your protein in a purely chemical way, you would expect that the modifications are only on the outside. So such modifications give you an idea of the 3D shape. Thanks to AlphaFold software, revealing the folding is no longer a priority, but the dynamic behaviour is much more important. To make the active site available, many proteins change their structure as soon as they bind a certain factor. If you find artefacts at a site that is normally dense, then you can assume that there are two conformations.’ 

Your research leans on existing data, so it is obvious that you are an advocate of open science. Does that work in this field? 

‘Yes and no. We have a lot of public data, PRIDE is running like a dream. But what does not work well yet is the annotation. With machine learning we wanted to predict, purely on the basis of the proteome, from which tissue a dataset originates. Finding test sets of which the origin is known proved to be a major problem. Even in the corresponding publications, we often could not find that data. Sometimes people did not even specify the instrument they had worked with; they just chose the top instrument from the drop-down menu. So we miss a lot of the future promise of that data. 

‘If you want your data to have a long and useful life, you have to furnish it with metadata’

I think we are too used to relying on written conclusions. You generated data to be able to publish, and nobody was ever going to reuse that data. Now that is changing. Your conclusions are probably out of date after four or five years, and the majority of papers are never read again. But you can still do all sorts of things with the data sets. Whole generations of scientists have never thought about that. But to ensure that your data have a long and useful life, you have to furnish them with metadata. When papers are submitted, we should pay much more attention to this.’ 

Shouldn’t you be more of a computer expert than a chemist by now?

‘That is a very dangerous statement. We are never going to replace chemists and biochemists; you need the people with the golden hands. But we can help them. We can indicate that someone’s preferred protein contains a modification that looks interesting. Then he can decide whether it is worth spending a year of his life on it. We need to get to a partnership.’ 

Machine learning and deep learning 

Artificial intelligence (AI) is an umbrella term for software that attempts to imitate human intelligence on some level, and at best can reason independently. Machine learning is a subset of this. Algorithms search for hidden patterns in existing data files that serve as ‘learning material’. Based on these patterns, they predict what the next action will lead to. Lennart Martens’ group in Ghent works with deep learning. This is an advanced form of machine learning in which the algorithms have a layered structure. Each layer, usually built up as a neural network, further elaborates on the output of the previous layer. This leads to an unprecedented level of detail.