Output desciption

In general, nanomotif outputs are provided as tab-separated values (TSV) files. Each TSV file contains one or more columns detailing the results, predictions, and underlying data.

Motif Discovery

Overview:
The motif discovery step identifies DNA methylation motifs within bins. It outputs bin-motifs.tsv . This files contain information about the discovered motifs, their modification types, and their degree of methylation.

Files:

bin-motifs.tsv: Contains identified motifs

Columns in bin-motifs.tsv:

Column	Description
reference	The identifier of the contig or bin in which the motif was discovered.
mod_type	The type of base modification associated with the motif. Uses single-letter codes (e.g., `a` for 6mA, `m` for 5mC, `21839` for 4mC).
motif	The detected DNA motif sequence. Uses IUPAC degenerate nucleotide codes to represent positions where multiple nucleotides are possible.
mod_position	The zero-based index of the modified base within the motif. For example, if `mod_position` is 0, the modification occurs on the first nucleotide of the motif.
n_mod	The count of motif occurrences considered “methylated” (fraction of mapped bases methylated, ≥0.7 by default).
n_nomod	The count of motif occurrences considered “unmethylated” (fraction of mapped bases methylated, <0.2 by default).
motif_type	Classification of the motif as palindromic, non-palindromic, or bipartite. Palindromic motifs read the same forward and backward, while bipartite motifs have a gap separating two distinct parts.
motif_complement	The reverse complement of the motif. Only reported if the reverse complement is identified.
mod_position_complement	The zero-based position of the modified base in the motif complement.
n_mod_complement	The number of methylated occurrences of the complement motif.
n_nomod_complement	The number of unmethylated occurrences of the complement motif.

Bin Improvement

Overview:
The bin improvement module refines genome bins by identifying contigs that may be contaminants. Suspect contigs are recorded in bin_contamination.tsv. If requested, cleaned bins (with contaminants removed) can be created.

Additionally, the include_contigs command assigns unbinned contigs to existing bins using classification models trained on methylation patterns, resulting in the include_contigs.tsv file.

Files:

bin_contamination.tsv: Lists contigs flagged as contamination.
include_contigs.tsv: Lists previously unbinned contigs and their new bin assignments.

bin_contamination.tsv

Contigs appear in this file if all four clustering methods (agg, spectral, gmm, hdbscan) assign the contig to a cluster different from the bin’s main cluster, suggesting contamination. The each contaminant will have 4 rows, one for each clustering algorithm, along with the cluster stats. If the --write_bins flag is specified new de-contaminated bins will be written to a bins folder.

Column	Description
contig	The identifier of the contig flagged as contamination.
bin	The name of the bin that the contig currently belongs to.
method	The clustering method used (agg, spectral, gmm, hdbscan).
cluster	The cluster ID assigned by the specified method. Methods produce different numeric cluster IDs.
bin_cluster	The cluster ID associated with the bin’s original assignment.
bin_length	The total length (in base pairs) of all contigs currently in the bin.
n_contigs_bin	The total number of contigs currently assigned to the bin.
fraction_contigs	The fraction of the bin’s contigs grouped under `bin_cluster` (e.g., 10/12 ≈ 0.83 if 10 out of 12 contigs are in `bin_cluster`).
fraction_length	The fraction of the bin’s length represented by contigs in `bin_cluster`. Similar to fraction_contigs but based on sequence length.

include_contigs.tsv

Assigning unbinned contigs to bins uses classification methods (Random Forest, Linear Discriminant Analysis, and k-Nearest Neighbors) trained on methylation data. If all three classifiers assigns an unbinned contig to the same bin, with a join mean probability above 0.80, the contig is assigned to that bin. This is called a high_confidence assignment. Besides the aforementioned high_confidence assignment, contigs can also be medium and low confidence assigments. medium_confidence assignments are contigs where all three classifiers agree but the join probability is below 0.8. low_confidence is when only two classifiers agree. high_confidence assignments are outputted in a new_contig_bin.tsv file.

Column	Description
contig	Identifier of the previously unbinned contig.
bin	The original bin assignment. “unbinned” indicates no prior assignment.
assigned_bin	The bin predicted for this contig by the classifiers.
method	The classification method (knn, lda, rf). knn = k-Nearest Neighbors, lda = Linear Discriminant Analysis, rf = Random Forest.
prob	The probability assigned by the method for placing the contig in `assigned_bin`.
mean_prob	The average probability from all methods for the contig-bin pair.
confidence	Qualitative confidence (“high_confidence”, “medium_confidence”, “low_confidence”). High confidence predictions are more reliable.

MTase-Linker

Overview:
The MTase-linker module attempts to link discovered motifs to predicted MTase genes. MTases are part of Restriction-Modification (RM) systems and recognize specific DNA motifs. By integrating motif discovery results with gene prediction and homology searches, MTase-linker outputs files listing MTase genes, their predicted motif specificity, and whether the discovered motifs can be confidently linked to them.

Files:

mtase_assignment_table.tsv: Lists all predicted MTase genes, their likely modification type and motif specificity, and whether a motif is linked.
nanomotif_assignment_table.tsv: Similar to bin-motifs.tsv, with added columns indicating whether a motif is linked to a specific MTase.

mtase_assignment_table.tsv

Column	Description
bin	The bin/genome in which the MTase gene is located.
gene_id	A unique identifier for each MTase gene, combining the contig ID and a gene number (e.g., “contig_12_5”).
contig	The contig on which the MTase gene is found.
mod_type_pred	The predicted modification type (e.g., ac for 6mA/4mC, m for 5mC).
sub_type	Predicted RM system subtype (I, II, IIG, III).
RM_system	TRUE/FALSE indicating if the MTase is part of a complete RM system.
motif_type_pred	Predicted motif type that the MTase may recognize (palindromic, non-palindromic, bipartite).
REbase_ID	Closest homolog MTase from REbase, if identity and coverage meet the threshold (≥80%).
motif_pred	Predicted recognition motif based on REbase annotation.
linked	TRUE/FALSE indicating if a motif could be unambiguously linked to the MTase gene. If TRUE, see `detected_motif`.
detected_motif	The exact motif sequence linked to this MTase, if `linked` is TRUE.

nanomotif_assignment_table.tsv

This file includes data from bin-motifs.tsv plus two additional columns: linked and candidate_genes.

linked (TRUE/FALSE): Indicates if the motif can be unambiguously attributed to an MTase.
candidate_genes: Lists the MTase genes linked to the motif if linked = TRUE, or other potential candidates if not.

Additional Outputs From Dependencies

MTase-linker uses external tools for gene prediction and annotation:

DefenseFinder output: .../defensefinder/{assembly_name}_processed_defense_finder_systems.tsv lists RM-system gene annotations to verify if MTases form part of a complete RM-system.
MTase amino acid sequences: .../defensefinder/{assembly_name}_processed_defense_finder_mtase.faa contains predicted MTase amino acid sequences.
All predicted protein sequences: .../prodigal/{assembly_name}_processed.faa contains amino acid sequences for all predicted proteins, including MTases.