Horizontal gene transfer (HGT) is a critical driver in the evolution and diversification of many microbial species. The patterns of gene gain and loss offer insights into the role of selection in the evolution of bacterial pangenomes and how bacteria adapt to new niches. These dynamics have significant implications for the development of antibiotic resistance and the design of vaccine and drug interventions targeting bacterial pathogens.
Methods analysing patterns of gene presence or absence frequently overlook errors introduced by the automated annotation and clustering of gene sequences. In particular, techniques adapted from ecological studies, can lead to misconceptions, as they might reflect the temporal diversity in the sampled genomes rather than actual variations in the dynamics of HGT.
To tackle these challenges, we have developed several algorithms, including Panaroo and Panstripe, that are robust against population structure, sampling bias, and errors in the predicted presence or absence of genes. Through simulations, we demonstrate that these algorithms can effectively identify differences in the rate and size of HGT events. Moreover, we show that the choice of algorithm used to define a 'core' genome in bacteria can significantly influence the subsequent reconstructions of bacterial phylogenies.
We illustrate the capability of these algorithms by analysing a diverse set of bacterial genome datasets representing major human pathogens including E. coli, K. pneumoniae and M. tuberculosis. Our findings underscore the importance and benefits of accounting for annotation errors and population structure when analysing prokaryotic pangenomes.