Output desciption

In general, nanomotif outputs are provided as tab-separated values (TSV) files. Each TSV file contains one or more columns detailing the results, predictions, and underlying data.


Motif Discovery

Overview:
The motif discovery step identifies DNA methylation motifs within bins. It outputs bin-motifs.tsv . This files contain information about the discovered motifs, their modification types, and their degree of methylation.

Files:

  • bin-motifs.tsv: Contains identified motifs

Columns in bin-motifs.tsv:

Column

Description

reference

The identifier of the contig or bin in which the motif was discovered.

mod_type

The type of base modification associated with the motif. Uses single-letter codes (e.g., a for 6mA, m for 5mC, 21839 for 4mC).

motif

The detected DNA motif sequence. Uses IUPAC degenerate nucleotide codes to represent positions where multiple nucleotides are possible.

mod_position

The zero-based index of the modified base within the motif. For example, if mod_position is 0, the modification occurs on the first nucleotide of the motif.

n_mod

The count of motif occurrences considered “methylated” (fraction of mapped bases methylated, ≥0.7 by default).

n_nomod

The count of motif occurrences considered “unmethylated” (fraction of mapped bases methylated, <0.2 by default).

motif_type

Classification of the motif as palindromic, non-palindromic, or bipartite. Palindromic motifs read the same forward and backward, while bipartite motifs have a gap separating two distinct parts.

motif_complement

The reverse complement of the motif. Only reported if the reverse complement is identified.

mod_position_complement

The zero-based position of the modified base in the motif complement.

n_mod_complement

The number of methylated occurrences of the complement motif.

n_nomod_complement

The number of unmethylated occurrences of the complement motif.


Bin Improvement

Overview:
The bin improvement module refines genome bins by identifying contigs that may be contaminants. Suspect contigs are recorded in bin_contamination.tsv. If requested, cleaned bins (with contaminants removed) can be created.

Additionally, the include_contigs command assigns unbinned contigs to existing bins using classification models trained on methylation patterns, resulting in the include_contigs.tsv file.

Files:

  • bin_contamination.tsv: Lists contigs flagged as contamination.

  • include_contigs.tsv: Lists previously unbinned contigs and their new bin assignments.

bin_contamination.tsv

Contigs appear in this file if all four clustering methods (agg, spectral, gmm, hdbscan) assign the contig to a cluster different from the bin’s main cluster, suggesting contamination. The each contaminant will have 4 rows, one for each clustering algorithm, along with the cluster stats. If the --write_bins flag is specified new de-contaminated bins will be written to a bins folder.

Column

Description

contig

The identifier of the contig flagged as contamination.

bin

The name of the bin that the contig currently belongs to.

method

The clustering method used (agg, spectral, gmm, hdbscan).

cluster

The cluster ID assigned by the specified method. Methods produce different numeric cluster IDs.

bin_cluster

The cluster ID associated with the bin’s original assignment.

bin_length

The total length (in base pairs) of all contigs currently in the bin.

n_contigs_bin

The total number of contigs currently assigned to the bin.

fraction_contigs

The fraction of the bin’s contigs grouped under bin_cluster (e.g., 10/12 ≈ 0.83 if 10 out of 12 contigs are in bin_cluster).

fraction_length

The fraction of the bin’s length represented by contigs in bin_cluster. Similar to fraction_contigs but based on sequence length.

include_contigs.tsv

Assigning unbinned contigs to bins uses classification methods (Random Forest, Linear Discriminant Analysis, and k-Nearest Neighbors) trained on methylation data. If all three classifiers assigns an unbinned contig to the same bin, with a join mean probability above 0.80, the contig is assigned to that bin. This is called a high_confidence assignment. Besides the aforementioned high_confidence assignment, contigs can also be medium and low confidence assigments. medium_confidence assignments are contigs where all three classifiers agree but the join probability is below 0.8. low_confidence is when only two classifiers agree. high_confidence assignments are outputted in a new_contig_bin.tsv file.

Column

Description

contig

Identifier of the previously unbinned contig.

bin

The original bin assignment. “unbinned” indicates no prior assignment.

assigned_bin

The bin predicted for this contig by the classifiers.

method

The classification method (knn, lda, rf). knn = k-Nearest Neighbors, lda = Linear Discriminant Analysis, rf = Random Forest.

prob

The probability assigned by the method for placing the contig in assigned_bin.

mean_prob

The average probability from all methods for the contig-bin pair.

confidence

Qualitative confidence (“high_confidence”, “medium_confidence”, “low_confidence”). High confidence predictions are more reliable.


MTase-Linker

Overview:
The MTase-linker module attempts to link discovered motifs to predicted MTase genes. MTases are part of Restriction-Modification (RM) systems and recognize specific DNA motifs. By integrating motif discovery results with gene prediction and homology searches, MTase-linker outputs files listing MTase genes, their predicted motif specificity, and whether the discovered motifs can be confidently linked to them.

Files:

  • mtase_assignment_table.tsv: Lists all predicted MTase genes, their likely modification type and motif specificity, and whether a motif is linked.

  • nanomotif_assignment_table.tsv: Similar to bin-motifs.tsv, with added columns indicating whether a motif is linked to a specific MTase.

mtase_assignment_table.tsv

Column

Description

bin

The bin/genome in which the MTase gene is located.

gene_id

A unique identifier for each MTase gene, combining the contig ID and a gene number (e.g., “contig_12_5”).

contig

The contig on which the MTase gene is found.

mod_type_pred

The predicted modification type (e.g., ac for 6mA/4mC, m for 5mC).

sub_type

Predicted RM system subtype (I, II, IIG, III).

RM_system

TRUE/FALSE indicating if the MTase is part of a complete RM system.

motif_type_pred

Predicted motif type that the MTase may recognize (palindromic, non-palindromic, bipartite).

REbase_ID

Closest homolog MTase from REbase, if identity and coverage meet the threshold (≥80%).

motif_pred

Predicted recognition motif based on REbase annotation.

linked

TRUE/FALSE indicating if a motif could be unambiguously linked to the MTase gene. If TRUE, see detected_motif.

detected_motif

The exact motif sequence linked to this MTase, if linked is TRUE.

nanomotif_assignment_table.tsv

This file includes data from bin-motifs.tsv plus two additional columns: linked and candidate_genes.

  • linked (TRUE/FALSE): Indicates if the motif can be unambiguously attributed to an MTase.

  • candidate_genes: Lists the MTase genes linked to the motif if linked = TRUE, or other potential candidates if not.


Additional Outputs From Dependencies

MTase-linker uses external tools for gene prediction and annotation:

  • DefenseFinder output: .../defensefinder/{assembly_name}_processed_defense_finder_systems.tsv lists RM-system gene annotations to verify if MTases form part of a complete RM-system.

  • MTase amino acid sequences: .../defensefinder/{assembly_name}_processed_defense_finder_mtase.faa contains predicted MTase amino acid sequences.

  • All predicted protein sequences: .../prodigal/{assembly_name}_processed.faa contains amino acid sequences for all predicted proteins, including MTases.