Output desciption
In general, nanomotif outputs are provided as tab-separated values (TSV) files. Each TSV file contains one or more columns detailing the results, predictions, and underlying data.
Motif Discovery
Overview:
The motif discovery step identifies DNA methylation motifs within bins. It outputs bin-motifs.tsv . This files contain information about the discovered motifs, their modification types, and their degree of methylation.
Files:
bin-motifs.tsv: Contains identified motifs
Columns in bin-motifs.tsv:
Column |
Description |
|---|---|
reference |
The identifier of the contig or bin in which the motif was discovered. |
mod_type |
The type of base modification associated with the motif. Uses single-letter codes (e.g., |
motif |
The detected DNA motif sequence. Uses IUPAC degenerate nucleotide codes to represent positions where multiple nucleotides are possible. |
mod_position |
The zero-based index of the modified base within the motif. For example, if |
n_mod |
The count of motif occurrences considered “methylated” (fraction of mapped bases methylated, ≥0.7 by default). |
n_nomod |
The count of motif occurrences considered “unmethylated” (fraction of mapped bases methylated, <0.2 by default). |
motif_type |
Classification of the motif as palindromic, non-palindromic, or bipartite. Palindromic motifs read the same forward and backward, while bipartite motifs have a gap separating two distinct parts. |
motif_complement |
The reverse complement of the motif. Only reported if the reverse complement is identified. |
mod_position_complement |
The zero-based position of the modified base in the motif complement. |
n_mod_complement |
The number of methylated occurrences of the complement motif. |
n_nomod_complement |
The number of unmethylated occurrences of the complement motif. |
Bin Improvement
Overview:
The bin improvement module refines genome bins by identifying contigs that may be contaminants. Suspect contigs are recorded in bin_contamination.tsv. If requested, cleaned bins (with contaminants removed) can be created.
Additionally, the include_contigs command assigns unbinned contigs to existing bins using classification models trained on methylation patterns, resulting in the include_contigs.tsv file.
Files:
bin_contamination.tsv: Lists contigs flagged as contamination.
include_contigs.tsv: Lists previously unbinned contigs and their new bin assignments.
bin_contamination.tsv
Contigs appear in this file if all four clustering methods (agg, spectral, gmm, hdbscan) assign the contig to a cluster different from the bin’s main cluster, suggesting contamination. The each contaminant will have 4 rows, one for each clustering algorithm, along with the cluster stats.
If the --write_bins flag is specified new de-contaminated bins will be written to a bins folder.
Column |
Description |
|---|---|
contig |
The identifier of the contig flagged as contamination. |
bin |
The name of the bin that the contig currently belongs to. |
method |
The clustering method used (agg, spectral, gmm, hdbscan). |
cluster |
The cluster ID assigned by the specified method. Methods produce different numeric cluster IDs. |
bin_cluster |
The cluster ID associated with the bin’s original assignment. |
bin_length |
The total length (in base pairs) of all contigs currently in the bin. |
n_contigs_bin |
The total number of contigs currently assigned to the bin. |
fraction_contigs |
The fraction of the bin’s contigs grouped under |
fraction_length |
The fraction of the bin’s length represented by contigs in |
include_contigs.tsv
Assigning unbinned contigs to bins uses classification methods (Random Forest, Linear Discriminant Analysis, and k-Nearest Neighbors) trained on methylation data.
If all three classifiers assigns an unbinned contig to the same bin, with a join mean probability above 0.80, the contig is assigned to that bin. This is called a high_confidence assignment. Besides the aforementioned high_confidence assignment, contigs can also be medium and low confidence assigments.
medium_confidence assignments are contigs where all three classifiers agree but the join probability is below 0.8. low_confidence is when only two classifiers agree.
high_confidence assignments are outputted in a new_contig_bin.tsv file.
Column |
Description |
|---|---|
contig |
Identifier of the previously unbinned contig. |
bin |
The original bin assignment. “unbinned” indicates no prior assignment. |
assigned_bin |
The bin predicted for this contig by the classifiers. |
method |
The classification method (knn, lda, rf). knn = k-Nearest Neighbors, lda = Linear Discriminant Analysis, rf = Random Forest. |
prob |
The probability assigned by the method for placing the contig in |
mean_prob |
The average probability from all methods for the contig-bin pair. |
confidence |
Qualitative confidence (“high_confidence”, “medium_confidence”, “low_confidence”). High confidence predictions are more reliable. |
MTase-Linker
Overview:
The MTase-linker module attempts to link discovered motifs to predicted MTase genes. MTases are part of Restriction-Modification (RM) systems and recognize specific DNA motifs. By integrating motif discovery results with gene prediction and homology searches, MTase-linker outputs files listing MTase genes, their predicted motif specificity, and whether the discovered motifs can be confidently linked to them.
Files:
mtase_assignment_table.tsv: Lists all predicted MTase genes, their likely modification type and motif specificity, and whether a motif is linked.
nanomotif_assignment_table.tsv: Similar to
bin-motifs.tsv, with added columns indicating whether a motif is linked to a specific MTase.
mtase_assignment_table.tsv
Column |
Description |
|---|---|
bin |
The bin/genome in which the MTase gene is located. |
gene_id |
A unique identifier for each MTase gene, combining the contig ID and a gene number (e.g., “contig_12_5”). |
contig |
The contig on which the MTase gene is found. |
mod_type_pred |
The predicted modification type (e.g., ac for 6mA/4mC, m for 5mC). |
sub_type |
Predicted RM system subtype (I, II, IIG, III). |
RM_system |
TRUE/FALSE indicating if the MTase is part of a complete RM system. |
motif_type_pred |
Predicted motif type that the MTase may recognize (palindromic, non-palindromic, bipartite). |
REbase_ID |
Closest homolog MTase from REbase, if identity and coverage meet the threshold (≥80%). |
motif_pred |
Predicted recognition motif based on REbase annotation. |
linked |
TRUE/FALSE indicating if a motif could be unambiguously linked to the MTase gene. If TRUE, see |
detected_motif |
The exact motif sequence linked to this MTase, if |
nanomotif_assignment_table.tsv
This file includes data from bin-motifs.tsv plus two additional columns: linked and candidate_genes.
linked (TRUE/FALSE): Indicates if the motif can be unambiguously attributed to an MTase.
candidate_genes: Lists the MTase genes linked to the motif if
linked= TRUE, or other potential candidates if not.
Additional Outputs From Dependencies
MTase-linker uses external tools for gene prediction and annotation:
DefenseFinder output:
.../defensefinder/{assembly_name}_processed_defense_finder_systems.tsvlists RM-system gene annotations to verify if MTases form part of a complete RM-system.MTase amino acid sequences:
.../defensefinder/{assembly_name}_processed_defense_finder_mtase.faacontains predicted MTase amino acid sequences.All predicted protein sequences:
.../prodigal/{assembly_name}_processed.faacontains amino acid sequences for all predicted proteins, including MTases.