<p align="center">
<img src="SpoMAG_title.png" alt="SpoMAG_title" width="600"/>
</p>

<p align="center">
<img src="SpoMAGlogo.png" alt="SpoMAG_logo" width="600"/>
</p>

## Scope


SpoMAG is an R-based machine learning tool developed to predict the sporulation potential of Metagenome-Assembled Genomes (MAGs) from uncultivated Firmicutes species, particularly from the Bacilli and Clostridia classes. 

SpoMAG leverages the complex combination of presence or absence of sporulation-associated genes to infer whether a genome is capable of undergoing sporulation, even in the absence of cultivation or complete genome assemblies. This strategy allows researchers to assess sporulation potential using only functional annotations from metagenomic data.


## Workflow overview


SpoMAG predicts the sporulation potential of a given genome through a three-step workflow: 

- `sporulation_gene_name()`, which parses a functional annotation table, such as those generated by eggNOG-mapper, and identifies genes related to sporulation using curated gene names and KEGG orthologs. It requires as input a data frame with the columns `Preferred_name`, `KEGG_ko`, and `genome_ID` (named exactly as such). The function outputs a filtered table of annotations containing sporulation-related genes.


- `build_binary_matrix()`, which converts the filtered annotations into a binary matrix, with rows representing genomes and columns representing genes (1 = present, 0 = absent). Missing genes are automatically filled with zeros to ensure consistent input for machine learning models.


- `predict_sporulation()`, which applies a pre-trained ensemble model combining Random Forest and Support Vector Machine predictions into a stacked meta-classifier. It outputs the classification label (`Sporulating` or `Non_sporulating`), model probabilities, and final ensemble probability value.

<p align="center">
<img src="SpoMAG_steps.png" alt="SpoMAG_steps" width="900"/>
</p>


SpoMAG abstracts the complexity of machine learning, allowing users to simply provide an annotation table and receive interpretable predictions. The tool is designed to be accessible even to those without prior expertise in bioinformatics or machine learning.

Its ensemble learning approach combines the predictions from Random Forest and Support Vector Machine classifiers, trained on high-quality labeled datasets of known spore-formers and non-spore-formers. The predictions are then used as features in a meta-classifier using model stacking, enhancing prediction accuracy and allowing SpoMAG to capture complementary decision boundaries from each model.

Model performance was evaluated using cross-validation and standard metrics including AUC-ROC, F1-score, Accuracy, precision, specificity, and recall. As a result, SpoMAG delivers high sensitivity and specificity across MAGs recovered from different hosts' microbiota.

Whether analyzing hundreds of MAGs or a single novel lineage, SpoMAG offers a robust, automated, and cultivation-independent solution to assess sporulation potential. No phenotypic validation or manual annotation is required, making it a practical tool for exploring the ecological and functional roles of spore-forming bacteria.

The repository for SpoMAG is at GitHub on the https://github.com/labinfo-lncc/SpoMAG. In this website, you can report a bug and get help.



## Citation

Paper under publication.



## Installation of the SpoMAG package

You can install the **SpoMAG** package directly from GitHub using:

```r
# Install devtools if not already installed
install.packages("devtools")

# Install SpoMAG from GitHub
devtools::install_github("labinfo-lncc-br/SpoMAG")
```

### Dependencies

SpoMAG depends on the following packages:

- dplyr, version 1.14
- tidyr, version 1.3.1
- tibble, version 3.2.1
- readr, version 2.1.5
- caret, version 7.0.1
- randomForest, version 4.7.1.2


### Preprocessing functions in SpoMAG

### 1. `sporulation_gene_name()`
It extracts sporulation-related genes from an annotation dataframe by searching for gene names and KEGG orthologs.
- Input: A dataframe with at least `Preferred_name`, `KEGG_ko` and `genome_ID` columns.
  
- Output: A filtered dataframe containing sporulation-related hits, each annotated with the standardized columns `spo_gene_name` and `spo_process`.
  

```r
genes <- sporulation_gene_name(df)
```

### 2. `build_binary_matrix()`
It creates a binary matrix indicating the presence (1) or absence (0) of known sporulation genes in each genome.
- Input: A dataframe output from the `sporulation_gene_name()` function.
  
- Output: A wide-format dataframe (genome in rows, genes in columns).
   

```r
matrix <- build_binary_matrix(genes)
```

Note: The function automatically fills in missing genes with 0 to ensure consistent input for sporulation-capacity prediction.

### Function to predict sporulation using SpoMAG

### 3. `predict_sporulation()`
It applies a pre-trained ensemble machine learning model to predict the sporulation potential of genomes based on the binary matrix of genes.

- Input:

  binary_matrix: Output from `build_binary_matrix()`


- Output: A dataframe with:

  genome_ID: the genome ID you are using as input


  RF_Prob: Random Forest probability of being a spore-former


  SVM_Prob: Support Vector Machine probability of being a spore-former


  Meta_Prob_Sporulating: Ensemble probability of being a spore-former

 
  Meta_Prediction: Final prediction (`Sporulating` or `Non_sporulating`)
  


```r
results <- predict_sporulation(binary_matrix = matrix)
```

### Input data format

To use SpoMAG, your input must be a functional annotation table, such as the output from eggNOG-mapper, containing at least three columns:

| genome_ID | Preferred_name | KEGG_ko |
|-----------|----------------|---------|
| G001      | spoIIIE        | K03466  |
| G001      | spo0A          | K07699  |
| G001      | -              | K01056  |
| G001      | pth            | -       |
| ...       | ...            | ...     |


- genome_ID: a name you have chosen for the genome you are working on
- Preferred_name: the predicted name of the gene
- KEGG_ko: the KEGG Orthology code (e.g., K07699)

Each row should represent one gene annotation.

Another difference of SpoMAG is its ability to infer gene presence in the annotation file even in cases where annotations are ambiguous. As shown in the example above, some rows can contain a valid `KEGG_ko` code but a missing or undefined `Preferred_name` (e.g., “-”), while others have a predicted gene name but no associated KO. SpoMAG integrates both fields to assign a unified `spo_gene_name`:

- If `Preferred_name` is missing but `KEGG_ko` matches a known sporulation-associated KO, the gene is identified based on the KO.

- If `KEGG_ko` is missing but `Preferred_name` matches a known sporulation gene, the gene is identified based on the name.

- If both are informative and match known references, preference is given to `Preferred_name`.

## Quick start
### Running an example with a single genome in the annotation file
This is a quick example using the included files: `one_sporulating.csv` (a known spore-former) and `one_asporogenic.csv` (a known non-spore-former).
The genome used for the spore-former here is the following:

- GCF_000007625.1 (_Clostridium tetani_ E88, Clostridia class)


| genome_ID       | Preferred_name | KEGG_ko |
|-----------------|----------------|---------|
| GCF_000007625.1 | spoIIIE        | K03466  |
| GCF_000007625.1 | spo0A          | K07699  |
| GCF_000007625.1 | -              | K01056  |
| GCF_000007625.1 | pth            | -       |
| ...             | ...            | ...     |


  
The genome used for the non-spore-former here is the following:

- GCF_000006785.2 (_Streptococcus pyogenes_ M1 GAS SF370, Clostridia class)


| genome_ID       | Preferred_name | KEGG_ko |
|-----------------|----------------|---------|
| GCF_000006785.2 | spo0A          | K07699  |
| GCF_000006785.2 | -              | K01056  |
| GCF_000006785.2 | pth            | -       |
| ...             | ...            | ...     |



```r
# Load package
library(SpoMAG)

# Load example annotation tables
file_spor <- system.file("extdata", "one_sporulating.csv", package = "SpoMAG")
file_aspo <- system.file("extdata", "one_asporogenic.csv", package = "SpoMAG")

# Read files
df_spor <- readr::read_csv(file_spor, show_col_types = FALSE)
df_aspo <- readr::read_csv(file_aspo, show_col_types = FALSE)

# Step 1: Extract sporulation-related genes
genes_spor <- sporulation_gene_name(df_spor)
genes_aspo <- sporulation_gene_name(df_aspo)

# Step 2: Convert to binary matrix
bin_spor <- build_binary_matrix(genes_spor)
bin_aspo <- build_binary_matrix(genes_aspo)

# Step 3: Predict using ensemble model (preloaded in package)

result_spor <- predict_sporulation(bin_spor)
result_aspo <- predict_sporulation(bin_aspo)

# View results
print(result_spor)
print(result_aspo)
```


### Running an example with more than one genome in the annotation file
This is a quick example using the included files: `ten_sporulating.csv` (ten known spore-formers) and `ten_asporogenic.csv` (ten known non-spore-formers).
The genomes used for the spore-formers here are the following:

- GCF_000011985.1 (_Lactobacillus acidophilus_ NCFM, Bacilli class)
- GCF_000338115.2 (_Lactobacillus plantarum_ ZJ316, Bacilli class)
- GCF_000016825.1 (_Lactobacillus reuteri_ DSM 20016, Bacilli class)
- GCF_000011045.1 (_Lactobacillus rhamnosus_ GG ATCC 53103, Bacilli class)
- GCF_000237995.1 (_Pediococcus claussenii_ ATCC BAA-344, Bacilli class)
- GCF_000145035.1 (_Butyrivibrio proteoclasticus_ B316, Clostridia class)
- GCF_000020605.1 (_Eubacterium rectale_ ATCC 33656, Clostridia class)
- GCF_900070325.1 (_Herbinix luporum_ SD1D, Clostridia class)
- GCF_003589745.1 (_Lachnoanaerobaculum umeaense_ DSM 23576, Clostridia class)
- GCF_000225345.1 (_Roseburia hominis_ A2-183, Clostridia class)

| genome_ID       | Preferred_name | KEGG_ko |
|-----------------|----------------|---------|
| GCF_000011985.1 | spoIIIE        | K03466  |
| GCF_000011985.1 | spo0A          | K07699  |
| GCF_000011045.1 | -              | K01056  |
| GCF_000011045.1 | pth            | -       |
| ...             | ...            | ...     |

  
The genomes used for the non-spore-formers here are the following:

- GCF_000011145.1 (_Bacillus halodurans_ C-125, Bacilli class)
- GCF_000008425.1 (_Bacillus licheniformis_ DSM 13 ATCC 14580, Bacilli class)
- GCF_000009045.1 (_Bacillus subtilis subsp. subtilis_ str. 168, Bacilli class)
- GCF_000338755.1 (_Bacillus thuringiensis serovar kurstaki_ HD73, Bacilli class)
- GCF_000010165.1 (_Brevibacillus brevis_ NBRC 100599, Bacilli class)
- GCF_000009205.2 (_Clostridioides difficile_ 630, Clostridia class)
- GCF_000008765.1 (_Clostridium acetobutylicum_ ATCC 824, Clostridia class)
- GCF_000063585.1 (_Clostridium botulinum_ A ATCC 3502, Clostridia class)
- GCF_000013285.1 (_Clostridium perfringens_ ATCC 13124, Clostridia class)
- GCF_000007625.1 (_Clostridium tetani_ E88, Clostridia class)

| genome_ID       | Preferred_name | KEGG_ko |
|-----------------|----------------|---------|
| GCF_000010165.1 | spoIIIE        | K03466  |
| GCF_000009205.2 | -              | K01056  |
| GCF_000009205.2 | pth            | -       |
| ...             | ...            | ...     |


```r
# Load package
library(SpoMAG)

# Load example annotation tables
file_spor <- system.file("extdata", "ten_sporulating.csv", package = "SpoMAG")
file_aspo <- system.file("extdata", "ten_asporogenic.csv", package = "SpoMAG")

# Read files
df_spor <- readr::read_csv(file_spor, show_col_types = FALSE)
df_aspo <- readr::read_csv(file_aspo, show_col_types = FALSE)

# Step 1: Extract sporulation-related genes
genes_spor <- sporulation_gene_name(df_spor)
genes_aspo <- sporulation_gene_name(df_aspo)

# Step 2: Convert to binary matrix
bin_spor <- build_binary_matrix(genes_spor)
bin_aspo <- build_binary_matrix(genes_aspo)

# Step 3: Predict using ensemble model (preloaded in package)
result_spor <- predict_sporulation(bin_spor)
result_aspo <- predict_sporulation(bin_aspo)

# View results
print(result_spor)
print(result_aspo)
```
