Despite examples of active small proteins being known for decades, large-scale genomics and metagenomics studies have traditionally only considered proteins above a minimum length. The major difficulty in detecting small proteins is that naïve adaptations of standard gene prediction algorithms will lead to a very large number of false positives.
We have been working on overcoming these issues by using very large datasets and machine learning. By using very large datasets, we can focus on sequences that are conserved evolutionarily and highly prevalent in the environment.
One particular class of small proteins of interest are antimicrobial peptides (AMPs). As the name suggests, these are small proteins that inhibit the growth of or outright kill other microorganisms. They are of interest for a number of reasons, including their potential as novel antibiotics, their role in shaping microbial communities, and their potential as a source of novel biotechnological products. We have developed a tool, Macrel, that can predict active antimicrobial peptides (AMPs) with high precision. We have used Macrel to build a catalog of 1 million putative AMPs from >60k metagenomes and 80k genomes, called the AMPSphere. This has resulted in not only novel sequences but also insights into microprotein evolution and ecology. In vitro testing revealed that the majority of these sequences are active against a range of pathogens, which we further validated in a preclinical model of infection.