# Output desciption In general, nanomotif outputs are provided as tab-separated values (TSV) files. Each TSV file contains one or more columns detailing the results, predictions, and underlying data. --- ## Motif Discovery **Overview:** The motif discovery step identifies DNA methylation motifs within bins. It outputs `bin-motifs.tsv` . This files contain information about the discovered motifs, their modification types, and their degree of methylation. **Files:** - **bin-motifs.tsv**: Contains identified motifs **Columns in bin-motifs.tsv:** | Column | Description | |-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | **reference** | The identifier of the contig or bin in which the motif was discovered. | | **mod_type** | The type of base modification associated with the motif. Uses single-letter codes (e.g., `a` for 6mA, `m` for 5mC, `21839` for 4mC). | | **motif** | The detected DNA motif sequence. Uses [IUPAC degenerate nucleotide codes](https://www.bioinformatics.org/sms/iupac.html) to represent positions where multiple nucleotides are possible. | | **mod_position** | The zero-based index of the modified base within the motif. For example, if `mod_position` is 0, the modification occurs on the first nucleotide of the motif. | | **n_mod** | The count of motif occurrences considered "methylated" (fraction of mapped bases methylated, ≥0.7 by default). | | **n_nomod** | The count of motif occurrences considered "unmethylated" (fraction of mapped bases methylated, <0.2 by default). | | **motif_type** | Classification of the motif as palindromic, non-palindromic, or bipartite. Palindromic motifs read the same forward and backward, while bipartite motifs have a gap separating two distinct parts. | | **motif_complement** | The reverse complement of the motif. Only reported if the reverse complement is identified. | | **mod_position_complement** | The zero-based position of the modified base in the motif complement. | | **n_mod_complement** | The number of methylated occurrences of the complement motif. | | **n_nomod_complement** | The number of unmethylated occurrences of the complement motif. | --- ## Bin Improvement **Overview:** The bin improvement module refines genome bins by identifying contigs that may be contaminants. Suspect contigs are recorded in `bin_contamination.tsv`. If requested, cleaned bins (with contaminants removed) can be created. Additionally, the `include_contigs` command assigns unbinned contigs to existing bins using classification models trained on methylation patterns, resulting in the `include_contigs.tsv` file. **Files:** - **bin_contamination.tsv**: Lists contigs flagged as contamination. - **include_contigs.tsv**: Lists previously unbinned contigs and their new bin assignments. ### bin_contamination.tsv Contigs appear in this file if all four clustering methods (agg, spectral, gmm, hdbscan) assign the contig to a cluster different from the bin’s main cluster, suggesting contamination. The each contaminant will have 4 rows, one for each clustering algorithm, along with the cluster stats. If the `--write_bins` flag is specified new de-contaminated bins will be written to a bins folder. | Column | Description | |------------------|--------------------------------------------------------------------------------------------------------------------------------------| | **contig** | The identifier of the contig flagged as contamination. | | **bin** | The name of the bin that the contig currently belongs to. | | **method** | The clustering method used (agg, spectral, gmm, hdbscan). | | **cluster** | The cluster ID assigned by the specified method. Methods produce different numeric cluster IDs. | | **bin_cluster** | The cluster ID associated with the bin’s original assignment. | | **bin_length** | The total length (in base pairs) of all contigs currently in the bin. | | **n_contigs_bin** | The total number of contigs currently assigned to the bin. | | **fraction_contigs** | The fraction of the bin’s contigs grouped under `bin_cluster` (e.g., 10/12 ≈ 0.83 if 10 out of 12 contigs are in `bin_cluster`). | | **fraction_length** | The fraction of the bin’s length represented by contigs in `bin_cluster`. Similar to fraction_contigs but based on sequence length. | ### include_contigs.tsv Assigning unbinned contigs to bins uses classification methods (Random Forest, Linear Discriminant Analysis, and k-Nearest Neighbors) trained on methylation data. If all three classifiers assigns an unbinned contig to the same bin, with a join mean probability above 0.80, the contig is assigned to that bin. This is called a `high_confidence` assignment. Besides the aforementioned `high_confidence` assignment, contigs can also be medium and low confidence assigments. `medium_confidence` assignments are contigs where all three classifiers agree but the join probability is below 0.8. `low_confidence` is when only two classifiers agree. `high_confidence` assignments are outputted in a `new_contig_bin.tsv` file. | Column | Description | |-----------------|-------------------------------------------------------------------------------------------------------------------------------------| | **contig** | Identifier of the previously unbinned contig. | | **bin** | The original bin assignment. "unbinned" indicates no prior assignment. | | **assigned_bin** | The bin predicted for this contig by the classifiers. | | **method** | The classification method (knn, lda, rf). knn = k-Nearest Neighbors, lda = Linear Discriminant Analysis, rf = Random Forest. | | **prob** | The probability assigned by the method for placing the contig in `assigned_bin`. | | **mean_prob** | The average probability from all methods for the contig-bin pair. | | **confidence** | Qualitative confidence ("high_confidence", "medium_confidence", "low_confidence"). High confidence predictions are more reliable. | --- ## MTase-Linker **Overview:** The MTase-linker module attempts to link discovered motifs to predicted MTase genes. MTases are part of Restriction-Modification (RM) systems and recognize specific DNA motifs. By integrating motif discovery results with gene prediction and homology searches, MTase-linker outputs files listing MTase genes, their predicted motif specificity, and whether the discovered motifs can be confidently linked to them. **Files:** - **mtase_assignment_table.tsv**: Lists all predicted MTase genes, their likely modification type and motif specificity, and whether a motif is linked. - **nanomotif_assignment_table.tsv**: Similar to `bin-motifs.tsv`, with added columns indicating whether a motif is linked to a specific MTase. ### mtase_assignment_table.tsv | Column | Description | |-----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------| | **bin** | The bin/genome in which the MTase gene is located. | | **gene_id** | A unique identifier for each MTase gene, combining the contig ID and a gene number (e.g., "contig_12_5"). | | **contig** | The contig on which the MTase gene is found. | | **mod_type_pred** | The predicted modification type (e.g., ac for 6mA/4mC, m for 5mC). | | **sub_type** | Predicted RM system subtype (I, II, IIG, III). | | **RM_system** | TRUE/FALSE indicating if the MTase is part of a complete RM system. | | **motif_type_pred** | Predicted motif type that the MTase may recognize (palindromic, non-palindromic, bipartite). | | **REbase_ID** | Closest homolog MTase from REbase, if identity and coverage meet the threshold (≥80%). | | **motif_pred** | Predicted recognition motif based on REbase annotation. | | **linked** | TRUE/FALSE indicating if a motif could be unambiguously linked to the MTase gene. If TRUE, see `detected_motif`. | | **detected_motif** | The exact motif sequence linked to this MTase, if `linked` is TRUE. | ### nanomotif_assignment_table.tsv This file includes data from `bin-motifs.tsv` plus two additional columns: `linked` and `candidate_genes`. - **linked (TRUE/FALSE):** Indicates if the motif can be unambiguously attributed to an MTase. - **candidate_genes:** Lists the MTase genes linked to the motif if `linked` = TRUE, or other potential candidates if not. --- ## Additional Outputs From Dependencies MTase-linker uses external tools for gene prediction and annotation: - **DefenseFinder output:** `.../defensefinder/{assembly_name}_processed_defense_finder_systems.tsv` lists RM-system gene annotations to verify if MTases form part of a complete RM-system. - **MTase amino acid sequences:** `.../defensefinder/{assembly_name}_processed_defense_finder_mtase.faa` contains predicted MTase amino acid sequences. - **All predicted protein sequences:** `.../prodigal/{assembly_name}_processed.faa` contains amino acid sequences for all predicted proteins, including MTases. ---