Determining the taxonomy and relative abundance of microorganisms in metagenomic data is a foundational problem in microbial ecology. To address the limitations of existing approaches, we developed ‘SingleM’, which estimates community composition using conserved regions within universal marker genes. SingleM accurately profiles complex communities of known microbial species, and is the only tool that detects species without genomic representation, even those representing novel phyla. Given SingleM’s computational efficiency, we applied it to 248,559 publicly available metagenomes, which are available in an online database ‘Sandpiper’ (https://sandpiper.qut.edu.au/). The vast majority of samples from marine, freshwater, sediment and soil environments are dominated by novel species lacking genomic representation (median relative abundance 75.0%). Quantifying the full diversity of Bacteria and Archaea in metagenomic data shows that microbial genome databases are far from saturated.
SingleM has several further applications. It can identify metagenomes containing lineages of interest, enabling the targeted recovery of novel metagenome-assembled genomes from underrepresented phyla. Accurate quantification of novel lineages also allows us to estimate the number of reads in a metagenome that are microbial. Soil metagenomes contain mostly microbial reads, but many animal metagenomes are dominated by eukaryotic reads.
Natural selection is a massively parallel set of experiments. Community profiles from across Earth’s ecosystems are the results, showing us each species’ physicochemical growth range. We show optimal growth temperature can be predicted from biogeographical observations. Large-scale estimation of microbial growth conditions may help predict how microbial species will react to, and exacerbate, climate change.