bacLIFE: a user-friendly framework for large-scale comparative genomics, lifestyle prediction, and identification of lifestyle-associated genes
bacLIFE was built and written in Python and R, is organized using a Snakemake workflow manager37, and is freely available as open-source software (https://github.com/Carrion-lab/bacLIFE). bacLIFE takes full and draft genome sequences in fasta format as input to automatically generate clusters of genes and absence/presence matrices of these clusters in the genomes, which constitute the basis for its predictions. The pipeline consists of three modules (Fig. 1). The clustering module (i) uses Markov clustering (MCL)38 in combination with linclust from the MMseqs2 tools39 (Supplementary Fig. 1), enabling the generation of a database of functional gene families (see Methods). The MCL method, based on all vs all BLAST alignment statistics, is known as an accurate algorithm to cluster proteins by function40. In addition, bacLIFE introduces a novel approach by integrating antiSMASH41 and BiG-SCAPE42 to also generate absence/presence matrices at the BGC level. The lifestyle prediction module (ii) employs a random forest machine learning model to forecast bacterial lifestyle or other specified metadata from these matrices (Supplementary Fig. 2). In the analytical module (iii) the results from the above-described modules are embedded in a Shiny user-friendly interface for comprehensive and interactive comparative genomics (Supplementary Fig. 3). In this interface, bacLIFE enables the exploration, visualization, and downstream analysis by Principal Coordinates Analysis (PCoA), dendrograms, pan-core-genome analyses and prediction of lifestyle-associated genes (pLAGs). A pLAG is referred to as a gene or gene cluster that exhibits a distinct pattern of presence for a specific (annotated) lifestyle while being largely absent in other lifestyles. The novelty and strength of bacLIFE lies in the combination and integration of diverse analytical tools in a logical, automated, and interactive framework for comparative genomics and prediction of new LAGs. For more detailed information, please see the documentation at https://github.com/Carrion-lab/bacLIFE.
Lifestyle prediction of Burkholderia/Paraburkholderia and Pseudomonas species
Burkholderia/Paraburkholderia and Pseudomonas genera (4611 and 12,235 genomes, respectively) were selected as case studies to evaluate the accuracy and predicted LAGs of bacLIFE due to the ample knowledge regarding their habitats and lifestyles (Supplementary Datasets 1). The computational time and speed of the bacLIFE framework was reduced by clustering all genomes at 99% Average Nucleotide Identity (ANI) similarity, resulting in 644, 200, and 2050 genomes of Burkholderia, Paraburkholderia and Pseudomonas, respectively (Supplementary Datasets 1). Thus, we mitigate the impact of genome redundancy present in multiclonal genomes, potentially influencing statistical analyses. Based on literature, three main lifestyles were defined in both datasets: environmental (e.g., Paraburkholderia spp., P. fluorescens), opportunistic animal pathogens (e.g., B. cepacia complex, B. pseudomallei complex, P. aeruginosa and stutzeri) and plant pathogens (e.g., B. plantarii, glumae, gladioli, P. syringae and viridiflava) (Supplementary Datasets 2). A significant proportion of genomes in both datasets remained unassigned to any lifestyles due to the absence of species-level designations in NCBI, which may compromise statistical power in the further steps of the bacLIFE pipeline.
To predict the lifestyle of non-annotated genomes, we trained and tested a random forest machine learning model with the gene cluster absence/presence matrices output from the bacLIFE clustering module. The Pseudomonas lifestyle prediction accuracy, represented by the sensitivity and specificity in ROC plots using five-fold cross-validation, showed that the Area Under the Curve (AUC)43 (Supplementary Fig. 4) was in the range of 0.96–0.99 for each lifestyle prediction (opportunistic animal pathogen, plant pathogen, environmental). These results are in line with the AUC values (0.985) reported for plant-associated Pseudomonas16. Regarding the Burkholderia/Paraburkholderia dataset, the random forest algorithm demonstrated exceptional performance, achieving perfect classification with an Area Under the Curve (AUC) value of 1.00 for each lifestyle (Supplementary Fig. 4).
To assess the random forest algorithm’s performance in predicting the lifestyle of potential novel species with no representative in databases, we employed a leave one species out validation. In the Burkholderia/Paraburkholderia dataset, we trained the random forest model without including any B. vietnamiensis genome, which is a known opportunistic human pathogen. Using this model, we predicted the lifestyle of B. vietnamiensis genomes and found that all 40 genomes were classified as opportunistic human pathogens with 90% of the trees voting for the same class. Using the same approach, the random forest algorithm correctly classified 92/124 P. fluorescens genomes as environmental strains and 30/35 P. cichorii genomes as plant pathogens, with a minimum number of trees voting for the class of 70% for both predictions. The lower performance of P. fluorescens in this analysis can be explained due to the reclassification of some genomes previously labeled as ‘fluorescens’, now designated as other species. This can be observed in genomes such as P. fluorescens W-6 (Supplementary Datasets 2), where the classifier value for an environmental lifestyle is 0.65 while most of fluorescens genomes exhibit values of more than 0.9.
The application of machine learning enabled us to assign lifestyles to these genomes by leveraging metadata from well-studied species/strains. Therefore we predicted the formerly unknown lifestyles of 137 out of 145 (94.5%) and 740 out of 976 (75.9%) Burkholderia/Paraburkholderia and Pseudomonas strains, respectively (Fig. 2 and Supplementary Datasets 2). Using bacLIFE’s clustering output from the Burkholderia/Paraburkholderia and Pseudomonas datasets and the augmented metadata file containing the predicted lifestyles (Supplementary Datasets 2), Principal Coordinate Analysis (PCoA) was used to visualize the lifestyles distribution (Fig. 2). For Burkholderia/Paraburkholderia, members of the Paraburkholderia genus, renowned for their beneficial effects on plants, formed a cohesive cluster (Supplementary Fig. 5a). On the other hand, plant pathogens like B. gladioli, glumae, and plantarii formed a smaller cluster situated between the Paraburkholderia cluster and the opportunistic animal pathogen group. Within the opportunistic animal pathogen group, we observed more variability attributed to various species and bacterial complexes such as the B. cepacia and B. pseudomallei complex. For Pseudomonas, well-known plant pathogens such as P. syringae, savastanoi and viridiflava among others grouped together and apart from the animal pathogenic P. stutzeri, mendocina, and aeruginosa (Supplementary Fig. 5b). Environmental Pseudomonas species and strains were found distributed widely, with the majority forming a larger cluster, containing plant-beneficial P. fluorescens and protegens, the plant pathogens P. gingeri and P. palleroniana and the animal pathogens P. monteilii and juntendi. These findings provide additional evidence of the adaptive ability of Pseudomonas species to shift between lifestyles depending on the environmental conditions44. In summary, the bacLIFE pipeline substantially improved the lifestyle prediction of poorly classified Burkholderia/Paraburkholderia and Pseudomonas species and strains using machine learning algorithms.
When examining the distribution of lifestyles across the phylogeny derived from both datasets, we observed a strong correlation between lifestyle and phylogeny (Fig. 3 and Supplementary Fig. 6). In the Burkholderia/Paraburkholderia dataset (Fig. 3), all plant pathogenic species cluster within the same phylogenetic clade. Similarly, in the Pseudomonas dataset (Supplementary Fig. 6), 80% of plant pathogens belong to the same clade, corresponding to species from the syringae complex. However, approximately 20% of plant pathogens, including species like P. gingeri and P. corrugata, exhibited distinct phylogenetic delineation in the tree. This high interconnection between lifestyle and phylogeny in both datasets poses a challenge in distinguishing between genes associated with a genuine lifestyle impact and those associated primarily with phylogeny.
Identification of general and lifestyle-associated functions
The relative abundance of general functions among the different lifestyles was calculated for Burkholderia/Paraburkholderia and Pseudomonas using the absence/presence gene cluster matrices from the bacLIFE clustering module with the COG-aggregated45 functional categories (Fig. 3, Supplementary Figs. 6 and 7). COG categories RNA processing and modification (A) and Cytoskeleton (Z) were removed from the analysis because they are mainly based in Eukaryotes. Statistical analysis (Kruskal–Wallis p < 0.05, Supplementary Datasets 3) showed that most COG categories were different among groups, suggesting the presence of specific LAGs (Supplementary Fig. 7). Clade-specific differences within a lifestyle were also observed, for example in the B. ubonensis group, for which a higher relative abundance of genes involved in Defense mechanisms (V), Cell motility (N), and Extracellular structure (W) was observed relative to other species with the same lifestyle (Fig. 3). A higher relative abundance of genes involved in Carbohydrate transport and metabolism (G) was observed in environmental and phytopathogenic Burkholderia/Paraburkholderia and Pseudomonas strains as compared to strains with an opportunistic pathogen lifestyle (Fig. 3 and Supplementary Figs. 6 and 7). These results confirm and extend earlier results by Levy et al.6. Inorganic ion transport and metabolism (P) has been reported as more abundant in Pseudomonas associated with humans, according to Saati-Santamaria et al.46. This trait is also observed to be more prevalent in opportunistic animal-associated bacteria within both the Pseudomonas and Burkholderia genera.
We observed a significant increase in the COG category Intracellular trafficking, secretion, and vesicular transport (U). Genes related to secretion systems and vesicles, which play a role in the virulence of the plant pathogen Pseudomonas47, were found within this category (Supplementary Fig. 7b). To further validate the distribution of bacLIFE’s gene cluster among the genomes, given the increased prevalence of COG category U in Pseudomonas plant pathogens, we examined the established distribution patterns of genes related to the Type III secretion system (T3SS) and its effectors. This comparison aimed to assess whether bacLIFE’s results align with the findings in the existing literature. The T3SS structural genes are divided in two groups, hrc genes that are highly conserved among Pseudomonads and hrp genes that are mainly present in plant pathogenic Pseudomonas such as P. syringae48. Using BLAST49, we mapped the hrc and hrp genes (10 and 16 genes, respectively) to the bacLIFE gene clusters, revealing that hrc genes were widely spread among all members of the Pseudomonas genus, whereas the hrp genes were mostly exclusively present in plant pathogenic P. syringae (Fig. 4). Interestingly, we observed that several Pseudomonas species/strains, such as four P. gingeri and three P. mendocina strains, not belonging to the P. syringae complex contained the hrc/hrp genes. These results suggest the importance of horizontal gene transfer in the evolution and dissemination of this well-known virulence trait50. Furthermore, we analyzed the distribution of 26 well-studied Type III secretion system (T3SS) effectors51 (Supplementary Fig. 8). These effectors were divided into two groups: ‘prevalent’ effectors, including AvrE, HopB, HopM, and HopAA, which are known to be present in most P. syringae species, and ‘sparse’ effectors, which are only found in a few P. syringae strains. Our results consistently confirmed the presence of the four prevalent effectors in the majority of P. syringae strains. Additionally, some prevalent effectors, such as AvrE and HopB families, were also detected in other members of the P. syringae complex, including P. viridiflava and P. cichorii. The remaining effectors showed a sparse distribution exclusively within the P. syringae complex, aligning with findings reported earlier in the literature51.
Prediction of lifestyle-associated genes in Burkholderia and Pseudomonas
A Fisher’s exact test in combination with a threshold of 70% group prevalence and >2 log2fold change between group presence is implemented in the analytical module and used to predict specific pLAGs. In Burkholderia/Paraburkholderia, this yielded 382, 478, and 786 pLAGs for the predicted environmental, opportunistic/animal pathogen and plant pathogen lifestyles respectively (Fig. 5a). In Pseudomonas, 48, 74, 377 pLAGs were found for the environmental, opportunistic/animal pathogen and plant pathogen lifestyles, respectively (Fig. 5b). The greater presence of pLAGs in Burkholderia/Paraburkholderia might be attributed to a stronger correlation between lifestyle and phylogeny when compared to Pseudomonas. Tools like phyloGLM52 and treeWAS53 were created to consider phylogeny when exploring gene associations. For plant pathogenic pLAGs (see methods section for more detailed information), both tools exhibited substantial overlap with the threshold used in bacLIFE. PhyloGLM, being notably more restrictive, identified fewer associations, while treeWAS, with a more relaxed approach, detected more associations compared to bacLIFE.
Both datasets demonstrated that the plant pathogen lifestyle exhibited the highest number of pLAGs, which aligns with the close clustering observed among genomes from the species/strains with this lifestyle in the PCoA analysis (Fig. 2a, b). In every lifestyle, a high proportion (~30%) of these pLAGs code for proteins with unknown functions in the COG database (Fig. 5a, b), underscoring the larger unexplored genomic information contained within the lifestyle-associated genes in comparison with the core-genome51. However, pLAGs constitute a portion of the accessory genome that is better understood in terms of functionality when compared to unique genes or genes found in only a single bacterium.
Known LAGs predicted by bacLIFE
To functionally validate the pLAGs predicted by bacLIFE, we focused on the phytopathogenic lifestyle of both taxa. We found 786 and 377 phytopathogenic pLAGs in Burkholderia/Paraburkholderia and Pseudomonas, respectively. We characterized pLAG-enriched regions as instances where five or more pLAGs were aligned in the same transcriptional direction in B. plantarii ATCC 43733 for Burkholderia and in P. syringae B728a for Pseudomonas. First, we looked into virulence mechanisms among these genomic regions and other pLAGs that are already described in literature for both plant pathogenic genera. In Burkholderia, we found four genomic regions enriched in pLAGs associated with type II, III, and VI secretion system proteins54 (Supplementary Datasets 4). Interestingly, in some of these genomic regions, we also found pLAGs annotated as phage genes next to secretion system proteins. This information sheds light on the hypothetical viral origin of the region and the utilization of phage-like lysozymes in secretion systems for plant pathogenic Burkholderia55. In addition, bacLIFE detected genes for the two synthases ToxC and ToxD, responsible for the production of the virulence factor toxoflavin56, which are exclusive to the genomes of B. glumae and B. gladioli species (Supplementary Datasets 4). Lastly, we found multiple pLAGs for N-acyl-homoserine lactones (AHLs) (Supplementary Datasets 4). Quorum sensing, mediated by AHLs, plays a pivotal role in establishing and maintaining successful symbiotic relationships between legumes and nitrogen-fixing rhizobia57. Other functions of AHLs are also in pathogenicity, for example for B. gladioli pv. agaricicola infection of the mushroom-forming fungus Agaricus spp.58.
For Pseudomonas, bacLIFE predicted a group of consecutive genes associated with the type II secretion system and its effectors59 (Supplementary Datasets 5). Additionally, bacLIFE detected a group of more than 10 consecutive genes involved in tellurite resistance (Supplementary Datasets 5) in Pseudomonas plant pathogens. Tellurite resistance is known to be present in plant pathogenic P. syringae60,61 and confers resistance to copper. Copper-based chemicals have been widely employed to combat bacterial diseases, resulting in the emergence of copper resistance mechanisms61,62. Previous studies have demonstrated that host plants utilize copper ions as a defense strategy against bacterial infections, activating immune responses through defense signaling pathways63,64.
We compared plant pathogenic pLAGs in Burkholderia and Pseudomonas. Using BLAST, we searched for bidirectional best hits between plant pathogenic pLAGs of Pseudomonas and Burkholderia, finding 13 shared pLAGs out of 786 in Burkholderia and 377 in Pseudomonas. This suggests a minimal 1–3% overlap between plant pathogen pLAGs of these two genera. Among these 13 genes, we find several sugar transporters (Supplementary Datasets 6). Sugar transporters play a crucial role in plant pathogenic bacteria as they are essential for these microbes to obtain sugars from their plant hosts65. These transporters, which are common among plant pathogens in two different genera, are likely responsible for the uptake of specialized sugars from plants, enabling these bacteria to thrive in the hostile environment they encounter.
Lastly, we performed a BLAST49 analysis on the 786 phytopathogenic pLAGs in Burkholderia/Paraburkholderia and the 377 phytopathogenic pLAGs in Pseudomonas. These were compared against the PHI-base, a database containing genes experimentally validated for their roles in pathogen-host interactions35. By selecting the representative sequence from all plant pathogen pLAGs gene clusters in Burkholderia, we conducted BLAST searches against the PHI-base, resulting in 253 out of 786 significant hits (e-value < 1e-2), which accounts for 16.4% of the queried pLAGs (Supplementary Datasets 4). In contrast, when we used the pan-genome as a representative sequence for all gene clusters, the BLAST analysis yielded 4788 hits out of a total of 157,696 aligned sequences, representing only 3% of the total. Similarly, in the case of Pseudomonas, 142 out of the 377 plant pathogen pLAGs had significant hits (Supplementary Datasets 5), while 4693 hits were obtained out of 191,300 gene clusters (2.45%) comprising the pan-genome. This comparison underscores the enrichment of virulence factors within the plant pathogen LAGs of Burkholderia and Pseudomonas (α = 1e-5, Fisher’s exact test). While we have a higher proportion of hits within our pLAGs, the majority of these hits were observed in the pan-genome. This phenomenon can be attributed to the filtering criteria we applied, specifically the requirement of >70% group presence and >2 log2fold change to define a pLAG. It is worth noting that genes excluded by this filter may still play roles in interactions with the host. These genes could include those present in only a few bacteria and are pathovar-specific, as well as genes found in most bacteria but are involved in host interactions only in certain strains. Consequently, these genes may not satisfy the rigorous pLAG criteria, highlighting a limitation of bacLIFE, which primarily operates based on gene distributions.
Discovery of new LAGs using bacLIFE
The high proportion (~30%) of pLAGs in plant pathogenic bacteria with unknown functions (Fig. 5 and Supplementary Datasets 4, 5) provided a basis for the discovery of novel virulence factors in both genera. To validate the importance of these novel pLAGs in virulence, we selected 13 pLAGs associated with the phytopathogenic lifestyle in Burkholderia. The bacLIFE Shiny app allows users to generate a table of the detected pLAGs in a specific genome of the dataset. This table includes the genome position of the pLAGs and allows to find regions enriched in pLAGs. We prioritized these regions to find potential operons but we wanted to also test the role of pLAGs that were in non-pLAG-enriched genomic regions. Therefore, we classified these pLAGs as either pLAGs in genomic regions enriched in pLAGs (consecutive) or in individual pLAGs (solo) based on whether five or more pLAGs were positioned in the same transcriptional direction in the B. plantarii ATCC 43733 genome (Table 1 and Supplementary Fig. 9). The pLAG-enriched regions chosen for mutagenesis were verified as pLAG-enriched regions in other Burkholderia plant pathogenic species by examining the genomes of B. gladioli ATCC 25417 and B. glumae AU6208.
The selected pLAGs coded for hypothetical proteins, transcriptional regulators or transmembrane proteins and were widespread and highly conserved in the plant pathogenic Burkholderia strains as per the LAGs definition (Fig. 6A and Supplementary Fig. 10). B. plantarii DSM 9509 was used as a model for experimental validation of the selected pLAGs (Supplementary Fig. 11). Site-directed mutants were generated for each of the 13 selected pLAGs and in vitro experiments showed that 5 out of the 13 generated mutants were not affected in growth as compared to the wild-type strain (Supplementary Fig. 12); these were selected for in vivo virulence assays with rice seedlings (Fig. 6b and Supplementary Fig. 13). The results showed that virulence of mutants in all the 5 pLAGs was severely reduced in rice seedlings as compared to the disease symptoms caused by the wild-type strain confirming their status as true LAGs (Fig. 6c and Supplementary Fig. 13).
The genes encoded in the mutants ΩLAG14 and ΩLAG16 were part of a genomic region of 8 genes associated with plant pathogens (Supplementary Fig. 9). LAG14 is a putative extracellular solute-binding protein with a PF00497 sensor domain, which recognizes nicotinate, quidalnate, pyridine-2,5-dicarboxylate, and salicylate66. In Bordetella pertussis, interaction of nicotinate with the sensor kinase modulates the expression of certain virulence factors67. Therefore, we postulate that LAG14 may not be a virulence factor itself but affects regulation of other virulence factors in Burkholderia. LAG16 is a glycosyltransferase (GT), which can take part in the synthesis of polysaccharides, crucial components of the cell wall and biofilms68. A previous study showed that a knockout of a single glycosyltransferase resulted in a significant reduction in virulence in Xanthomonas citri, highlighting the potential role of GTs in virulence69. LAG22 is a small protein annotated as alpha/beta hydrolase with an undescribed DUF4180 domain with homologs present in other Gram-positive bacteria and encoded in a genomic region composed of five genes associated with plant pathogens (Supplementary Fig. 9). To date, however, none of these proteins have been functionally characterized. On the other hand, LAG11 encodes a hypothetical protein without any identified domains and is likely co-expressed in an operon with a gene encoding an M14 metallopeptidase. None of the adjacent genes were classified as LAGs (Supplementary Fig. 9).
Last, LAG23 encodes a homoserine dehydrogenase, an enzyme involved in conversion of L-aspartate-4-semialdehyde to L-homoserine. This is a critical step in the aspartate pathway for the biosynthesis of lysine, methionine, threonine, and isoleucine, and is involved in cellular functions such as cell wall biosynthesis, translation, and carbon metabolism70. The aspartate pathway is present exclusively in plants and microorganisms, not in mammals, making homoserine dehydrogenase an attractive candidate for development of new pesticides or antibiotics with minimal effect in animals71. Previous work has shown that a null deletion of homoserine dehydrogenase gene in Mycobacterium tuberculosis H37Rv had a bactericidal effect, showing that inhibition of the aspartate pathway leads to elimination of chronic infections72. Similar to LAG11, none of the genes present in the genomic context of LAG23 were classified as LAGs (Supplementary Fig. 9).
To address the potential influence of phylogeny on pLAG selection, we examined the distribution and correlation of pLAGs chosen for mutagenesis and experimental validation (Fig. 6a). We found that three out of five LAGs compromised in virulence on rice seedlings (LAGs 14, 22, and 23) were entirely correlated with the phylogeny and present in all plant pathogenic Burkholderia genomes in the dataset. This observation raises an important point: excluding genes highly correlated with phylogeny may result in missing putative LAGs. However, it may also increase the number of predicted LAGs linked to phylogenetic noise. We compared our LAG selection method with other methods that take phylogeny into account such as phyloGLM and treeWAS (see methods). Regarding to the 8 pLAGs with reduced growth, we decided to not subject these mutants to plant experiments, as their reduction in virulence may be correlated with reduced bacterial growth given the interconnected relationship between colonization and pathogenicity. While there could be variations in colonization within plant tissues or among different plant species, the practical challenges associated with testing multiple species make such investigations beyond the scope of this study.
In summary, with our experiments we were able to validate the role of 5 new true LAGs identified by bacLIFE for their role in virulence of Burkholderia. bacLIFE was more efficient in the identification of genes involved in the virulence process when LAGs were present in LAG-enriched regions, 3 out of the 5 (LAG14, LAG16 and LAG22) were present in two genomic regions enriched in other LAGs, while the 2 other LAGs (LAG11, LAG23) were solos (Table 1 and Supplementary Fig. 9). Exploring genomic regions enriched in LAGs as a potential filtering approach holds promise in further refining LAG selection and identifying genuine lifestyle associations and could help in differentiate real LAGs from phylogenetic noise.
Integrated secondary metabolite analysis in bacLIFE reveals known and unknown Biosynthetic Gene Clusters linked to a phytopathogenic lifestyle
Through the integration of antiSMASH41 and BiG-SCAPE42 in bacLIFE, we were able to study secondary metabolism of Burkholderia/Paraburkholderia and Pseudomonas. The antiSMASH results showed that clusters of BGCs or BiG-SCAPE Gene Cluster Families (GCFs) were mainly distributed and conserved among bacterial species or clades with conserved lifestyles (Supplementary Figs. 14, 15, and 16). In Burkholderia/Paraburkholderia, the secondary metabolite repertoire was composed of a high genomic abundance of genes encoding terpene, bacteriocin, and homoserine lactone synthetases58,73,74 (Supplementary Fig. 17a). Within Pseudomonas, the most abundant class of BGCs were Non-Ribosomal Peptide Synthetase (NRPS) families, consistent with previous studies75,76,77 (Supplementary Fig. 17a). NRPSs play a crucial role in the biosynthesis of important compounds in pseudomonads, including siderophores, lipopeptides, antibiotics, and virulence factors, contributing to their survival and pathogenicity75,76,77. When examining the average number of BGCs per genome across different lifestyles (Supplementary Fig. 17b, c), we observed a trend towards a higher number of NRPSs (Student’s t-test α = 0.01) in plant pathogen genomes of both Burkholderia/Paraburkholderia and Pseudomonas. While this association is well-documented in Pseudomonas, limited information is available on the role of NRPSs in plant pathogenic Burkholderia78.
Using the GCFs absence/presence matrix generated by the bacLIFE clustering module and applying the Fisher’s exact test (P < 0.01 and >15% group presence), we were able to find GCFs statistically significantly associated with a plant pathogenic lifestyle in Burkholderia/Paraburkholderia and Pseudomonas, in a similar fashion as described above for the LAGs. In Burkholderia (Supplementary Datasets 7), we found multiple phytopathogenic BGC families associated with terpene and homoserine lactone synthesis, well-known virulence factors in Burkholderia gladioli58. In Pseudomonas (Supplementary Datasets 8), we found BGCs likely encoding the production of the well-studied syringofactin and syringomycin, two NRPS-encoded lipopeptides that are critical for pathogenicity79,80 as well as two BGC families with an NRPS gene showing 41% and 40% similarity to the NRPS gene of the fragin BGC of Burkholderia cenocepacia H111 according to antiSMASH. Fragin is known as a metallophore with antifungal and antibacterial activity81. In addition, bacLIFE analysis resulted in the discovery of an unidentified and plant pathogen-associated Non-Ribosomal Peptide Synthetase – Polyketide Synthase (NRPS-PKS). NRPS-PKS hybrid compounds are relatively scarce in Pseudomonas (Supplementary Fig. 16b). However, a subset of these compounds, including syringolin, pyoluteorin, and coronatine, are recognized for their diverse functionalities, such as inducing chlorosis, acting as fungicides, or suppressing plant defense mechanisms82.
The identified hybrid gene cluster was found in various plant pathogens, including P. syringae and P. viridiflava, among other Pseudomonas species (Fig. 7a, b, and Supplementary Datasets 8). A phylogenetic tree of Pseudomonas strains annotated as plant pathogen, combined with a BLAST search (e-value = 0.01) and structure comparison based on antiSMASH outputs, demonstrated that this yet unknown NRPS-PKS gene follows a similar distribution to the BGCs of well-known virulence factors syringofactin and syringomycin BGC families within the P. syringae species while being highly conserved also in P. viridiflava (Fig. 7b). The genes present in this NRPS-PKS cluster were highly conserved, while the flanking regions were variable (Fig. 7c). Based on antiSMASH prediction, the NRPS-PKS encodes a peptide composed of the amino acids Arginine-Proline-Cysteine-X-X-Proline, although the compound and its structural composition remain yet unknown (Fig. 7d). NRPS-PKS typically produce a diverse group of natural products with complex chemical structures and pharmaceutical potential. Through comparative sequence analysis, we identified a significant match for the longest core biosynthetic gene and a flanking gene with amino-acid sequence identity of 49% and 58%, respectively, to the WP_004571777.1 and WP_004571774.1 proteins encoded in the rimosamide biosynthetic gene cluster of Streptomyces rimosus subsp. rimosus (BGC0001760), as retrieved from the MIBiG83 database. Rimosamides and detoxins are rare secondary metabolites that exhibit protective anti-antibiotic activities. These compounds are synthesized on modular NRPS-PKS enzyme complexes following a conserved thiotemplate mechanism84. To study the functional role of this novel BGC in virulence, a mutant in the first core biosynthetic gene was generated in P. syringae pv. phaseolicola 1448A (Pph ΩNRPS) and tested in in vivo experiments with bean. Mutant Pph ΩNRPS strain was not affected in growth (Supplementary Fig. 12) but showed reduced virulence (halo blight) as compared to the wild-type strain at 4 and 7 days post-inoculation (dpi) (Fig. 7e). During the infection process in the tissue, water-soaked spots are produced on leaves, which starts with a chlorotic process characterized by the development of yellowing areas. Next, damages quickly expand to necrosis, which produces the formation of darkened lesions, causing damage to the plant tissue. Image analysis of infected leaves was performed to determine the percentage of necrotic and chlorotic lesions per infiltrated area (Fig. 7f and Supplementary Fig. 18). Necrotic lesions of the tissue induced by wild-type Pph 1448A covered approximately 15.57 ± 2.61% and 68.36 ± 13.66% of total infiltrated area at 4 and 7 dpi, respectively. In contrast, leaves inoculated with Pph ΩNRPS exhibited significantly fewer necrotic lesions than those caused by the wild-type strain, accounting for approximately 4.06 ± 2.07% and 19.55 ± 6.53% of the infiltrated area at 4 and 7 dpi, respectively (Fig. 7f). At 7 dpi, however, chlorosis of the tissue induced by Pph ΩNRPS mutant was significantly higher compared to the wild-type strain (31.17 ± 5.16% vs 19.58 ± 7.8%) (Supplementary Fig. 18). These results suggest a delay in symptom development upon leaf inoculation with Pph ΩNRPS. To further test the implications of this novel NRPS-PKS in virulence of Pph 1448A, competition assays between the wild-type strain and Pph ΩNRPS mutant were performed (Fig. 7g). The competitive index (CI) of the mutant relative to the wild-type strain showed that the wild-type strain was able to outcompete the mutant strain in planta, revealing that the wild-type strain had a competitive advantage over the mutant strain, as evidenced by the calculated CI values of 0.5. (Fig. 7g). Altogether these results show the involvement of this NRPS-PKS in the fine-tuned process of infection of Pph 1448A and that it is required for competitive growth and survival on bean leaves. We postulate that this newly identified NRPS-PKS gene cluster, which shows high conservation among various plant pathogenic Pseudomonas, codes for the production of specific compound(s) that either enhance pathogenicity or protect the pathogen against plant defense metabolites. To test this hypothesis, large-scale isolation and purification, structural identification, extensive bioassays, and mode-of-action experiments will be required.
In summary, the comprehensive computational workflow presented here as bacLIFE enabled large-scale comparative genomics, prediction of lifestyles, and genes associated with the predicted lifestyle. Additionally, the interactive and intuitive user interface we developed enables comprehensive investigation and visualization of these advanced outputs. In this study, we have shown the potential of bacLIFE to predict bacterial lifestyle using machine learning algorithms trained with genetic signature distributions. We also demonstrated the accuracy of bacLIFE in the prediction of known and unknown LAGs, several of which were validated experimentally as true LAGs. Using B. plantarii DSM 9509 as a model, site-directed mutagenesis and plant assays revealed 5 selected LAGs involved in virulence. Importantly, bacLIFE was more efficient in LAG prediction in B. plantarii when these were present in regions enriched in LAGs. bacLIFE also led to the discovery of a novel NRPS-PKS gene cluster in plant pathogenic Pseudomonas species. Subsequent mutagenesis of this gene cluster in P. syringae pv. phaseolicola 1448A showed a reduced virulence in bean, reinforcing the potential of bacLIFE as a valuable tool for the discovery of new virulence-associated genes. While the current bacLIFE pipeline is tailored for prokaryotic genomes, the framework can be extended to eukaryotic organisms, in particular fungi and yeasts. bacLIFE will not only advance our understanding of the functions and evolution of genes associated with a specific lifestyle, but also provide a valuable resource for risk assessment studies of specific microorganisms or synthetic microbial communities targeted for agronomic or environmental applications.