Whether pathogenic bacteria should be classified based on their phylogenetic relationships, that represent their evolutionary history, or phenotypic traits, that directly reflect their pathogenicity, is an ongoing debate between microbiologists and clinicians. Phylogenetic relationships have historically been assessed based on conserved, vertically transmitted genomic elements, e.g 16s ribosomal RNA gene (1). Indeed, the prokaryotic tree of life has been generated based on concatenated alignments of 120 and 53 (respectively for the bacterial and archaeal domains) conserved single copy marker gene sets in the genome taxonomy database (GTDB) (2) However, even though these lineages can be classified in a systematic manner, not always do these species definitions map towards meaningful groupings for applied microbiology as a subset their phenotypic traits (e.g. virulence factors that encode pathogenicity) often transmit horizontally (3, 4). As such, these two classification approaches often do not align.
Bacillus cereus group is a species complex with many closely related lineages, which have been associated with respiratory illness, diarrhoea, emesis, meningitis, food spoilage, and insecticidal activity leading to significant relevance across industries (5, 6). However, the underlying taxonomic ambiguity within these strains has made taxonomically deconvoluting this group a challenge.
Recent studies have demonstrated that these lineages can be differentiated by the presence of mobile genetic elements that are vertically inherited (6, 7). For example, the key virulence factors of B. anthracis, Anthrax toxin and the capsule protein, are encoded within its plasmids pXO1 and pXO2, and the virulence of emetic B. cereus is associated with Cereulide gene carried by pCER270 plasmid (5, 8).
We present a model that utilises presence/absence information of these virulence factors, along with other genomic features extracted from whole genome sequences such as k-mer composition and gene annotations that we believe will prevent misclassifications of the members of the B. cereus group through the use of the GTDB. We assessed the accuracy of the classifier as a function of genome completeness and contamination, as incomplete genomes may be missing key virulence factors, a limitation that affects the robustness of traditional genome-based detection methods.