Usage
Motif discovery
The motif discovery process identifies motifs at bin levels. The command require the relation between bins and contigs to be specified. This can be done in three ways:
A TSV file specifying which bin contigs belong to (
-cor--contig_bin).A list of bin FASTA files with contig names as headers (
-for--files).A directory containing bin FASTA files with contig names as headers (
-dor--directory). The file extension of the bin FASTA files can be specified using the--extensionflag. The motif discovery process requires an assembly file and a modkit pileup file as input. The output will be written to the specified output folder, which will contain abin-motifs.tsvfile summarizing the identified motifs per bin. See here for detailed output description.
The primary parameter to tune is the --min_motif_score, which determines the minimum score for a motif to be kept after identification. A lower value will result in more motifs being identified, but may also include more false positives. The default value is set to 1, which is a good starting point for most datasets.
Minimum score of 0.5: Very sensitive, will identify many motifs, but also many false positives. In addition to small palindromic and non-palindromic motifs, most very degenerate or long bipartite motifs may be identified. In addition to motifs for systematic errors around other methylation types (e.g., C5mCGG for 6mA, giving rise to 6mA motif variant containing CCGG, e.g., 6mACCGG), motifs for systematic errors around non-methylated positions may be included, particularily 5mC motifs in high GC% organismns.
Minimum score of 1.0: Balanced sensitivity and specificity, suitable for most datasets. Will identify all small palindromic and non-palindromic motifs. Some very degenerate or long bipartite motifs may be missed. Motifs for systematic errors around other methylation type may be included (C5mCWGG for 6mA, giving rise to 6mA motif variant containing CCWGG, e.g. 6mANCCTGG).
Minimum score of 1.5: Very specific, will identify fewer motifs, but with high confidence. May miss some true motifs, particularly very degenerate or long bipartite motifs. Motifs for systematic errors around other methylation types or in high GC% organismns are unlikely to be included.
QUICK START
ASSEMBLY="path/to/assembly.fasta"
PILEUP="path/to/pileup.tsv"
BINS="path/to/bins"
OUT="path/to/output"
nanomotif motif_discovery $ASSEMBLY $PILEUP -d $BINS --out $OUT
usage: nanomotif motif_discovery (-c CONTIG_BIN | -f FILES [FILES ...] | -d DIRECTORY) [--extension EXTENSION] [--out OUT] [--methylation_threshold_low METHYLATION_THRESHOLD_LOW]
[--methylation_threshold_high METHYLATION_THRESHOLD_HIGH] [--search_frame_size SEARCH_FRAME_SIZE] [--minimum_kl_divergence MINIMUM_KL_DIVERGENCE]
[--min_motif_score MIN_MOTIF_SCORE] [--threshold_valid_coverage THRESHOLD_VALID_COVERAGE] [--min_motifs_bin MIN_MOTIFS_BIN] [-t THREADS] [-v]
[--seed SEED] [-h]
assembly pileup
positional arguments:
assembly path to the assembly file.
pileup path to the modkit pileup file.
contig bin arguments, use one of::
-c CONTIG_BIN, --contig_bin CONTIG_BIN
TSV file specifying which bin contigs belong.
-f FILES [FILES ...], --files FILES [FILES ...]
List of bin FASTA files with contig names as headers.
-d DIRECTORY, --directory DIRECTORY
Directory containing bin FASTA files with contig names as headers.
--extension EXTENSION
File extension of the bin FASTA files if using -d (DIRECTORY) argument. Default is '.fasta'.
Options:
--out OUT path to the output folder
--methylation_threshold_low METHYLATION_THRESHOLD_LOW
A position is considered non-methylated if fraction of methylation is below this threshold. Default: 0.3
--methylation_threshold_high METHYLATION_THRESHOLD_HIGH
A position is considered methylated if fraction of methylated reads is above this threshold. Default: 0.7
--search_frame_size SEARCH_FRAME_SIZE
length of the sequnces sampled around confident methylation sites. Default: 40
--minimum_kl_divergence MINIMUM_KL_DIVERGENCE
Minimum KL-divergence for a position to considered for expansion in motif search. Higher value means less exhaustive, but faster search. Default: 0.05
--min_motif_score MIN_MOTIF_SCORE
Minimum score for a motif to be kept after identification. Default: 1
--threshold_valid_coverage THRESHOLD_VALID_COVERAGE
Minimum valid base coverage (Nvalid_cov) for a position to be considered. Default: 5
--min_motifs_bin MIN_MOTIFS_BIN
Minimum number of motif observations in a bin. Default: 20
general arguments:
-t THREADS, --threads THREADS
Number of threads to use. Default is 1
-v, --verbose Increase output verbosity. (set logger to debug level)
--seed SEED Seed for random number generator. Default: 1
-h, --help show this help message and exit
Bin improvement
Bin contamination
After motif identification it is possible to identify contamination in bins using the bin-motifs.tsv, assembly and pileup. This will generate a bin_contamination.tsv specifying the contigs, which is flagged as contamination.
The detect_contamination command detects putative contamination using four clustering methods (agg, spectral, gmm, hdbscan), all of which have to flag the contig as a contaminant.
See here for detailed output description.
QUICK START
ASSEMBLY="path/to/assembly.fasta"
PILEUP="path/to/pileup.bed"
BIN_MOTIFS="path/to/nanomotif/bin-motifs.tsv"
CONTIG_BIN="path/to/contig_bin.tsv"
OUT="path/to/output"
nanomotif detect_contamination --pileup $PILEUP --assembly $ASSEMBLY --bin_motifs $BIN_MOTIFS --contig_bins $CONTIG_BIN --out $OUT
usage: nanomotif detect_contamination [-h] --pileup PILEUP --assembly ASSEMBLY --bin_motifs BIN_MOTIFS --contig_bins CONTIG_BINS [-t THREADS]
[--min_valid_read_coverage MIN_VALID_READ_COVERAGE] [--methylation_threshold METHYLATION_THRESHOLD] [--num_consensus NUM_CONSENSUS]
[--force] [--write_bins] --out OUT [--contamination_file CONTAMINATION_FILE]
optional arguments:
-h, --help show this help message and exit
-t THREADS, --threads THREADS
Number of threads to use for multiprocessing
--min_valid_read_coverage MIN_VALID_READ_COVERAGE
Minimum read coverage for calculating methylation [used with methylation_util executable]
--methylation_threshold METHYLATION_THRESHOLD
Filtering criteria for trusting contig methylation. It is the product of mean_read_coverage and N_motif_observation. Higher value means stricter
criteria. [default: 24]
--num_consensus NUM_CONSENSUS
Number of models that has to agree for classifying as contaminant
--force Force override of motifs-scored-read-methylation.tsv. If not set existing file will be used.
--write_bins If specified, new bins will be written to a bins folder. Requires --assembly_file to be specified.
--contamination_file CONTAMINATION_FILE
Path to an existing contamination file if bins should be outputtet as a post-processing step
Mandatory Arguments:
--pileup PILEUP Path to pileup.bed
--assembly ASSEMBLY Path to assembly file [fasta format required]
--bin_motifs BIN_MOTIFS
Path to bin-motifs.tsv file
--contig_bins CONTIG_BINS
Path to bins.tsv file for contig bins
--out OUT Path to output directory
Include unbinned contigs
After motif identification, it is possible to assign unbinned contigs to bins using the bin-motifs.tsv, assembly, and pileup.
The include_contigs command assigns unbinned contigs in the assembly file to bins by training three supervised classifiers, random forest, linear discriminant analysis, and k-neighbors classifier. This will generate include_contigs.tsv specifying the contigs, assighned a new bin. See here for detailed output description.
If decontamination should not be performed, the include_contigs can be run without the --run_detect_contamination flag or without the --contamination_file flag.
Note: Assigning contigs based purely on methylation patterns can lead to errors as MAGs can share methylation patterns, which is especially problematic for unrecovered MAGs.
QUICK START
ASSEMBLY="path/to/assembly.fasta"
PILEUP="path/to/pileup.bed"
BIN_MOTIFS="path/to/nanomotif/bin-motifs.tsv"
CONTIG_BIN="path/to/contig_bin.tsv"
OUT="path/to/output"
nanomotif include_contigs --pileup $PILEUP --assembly $ASSEMBLY --bin_motifs $BIN_MOTIFS --contig_bins $CONTIG_BIN --run_detect_contamination --out $OUT
usage: nanomotif include_contigs [-h] --pileup PILEUP --assembly ASSEMBLY
--bin_motifs BIN_MOTIFS --contig_bins
CONTIG_BINS [-t THREADS]
[--min_valid_read_coverage MIN_VALID_READ_COVERAGE]
[--methylation_threshold METHYLATION_THRESHOLD]
[--num_consensus NUM_CONSENSUS] [--force]
[--write_bins] --out OUT
[--mean_model_confidence MEAN_MODEL_CONFIDENCE]
[--contamination_file CONTAMINATION_FILE | --run_detect_contamination]
options:
-h, --help show this help message and exit
-t THREADS, --threads THREADS
Number of threads to use for multiprocessing
--min_valid_read_coverage MIN_VALID_READ_COVERAGE
Minimum read coverage for calculating methylation
[used with methylation_util executable]
--methylation_threshold METHYLATION_THRESHOLD
Filtering criteria for trusting contig methylation. It
is the product of mean_read_coverage and
N_motif_observation. Higher value means stricter
criteria. [default: 24]
--num_consensus NUM_CONSENSUS
Number of models that has to agree for classifying as
contaminant
--force Force override of motifs-scored-read-methylation.tsv.
If not set existing file will be used.
--write_bins If specified, new bins will be written to a bins
folder. Requires --assembly_file to be specified.
--mean_model_confidence MEAN_MODEL_CONFIDENCE
Mean probability between models for including contig.
Contigs above this value will be included. [default:
0.8]
--contamination_file CONTAMINATION_FILE
Path to an existing contamination file to include in
the analysis
--run_detect_contamination
Indicate that the detect_contamination workflow should
be run first
Mandatory Arguments:
--pileup PILEUP Path to pileup.bed
--assembly ASSEMBLY Path to assembly file [fasta format required]
--bin_motifs BIN_MOTIFS
Path to bin-motifs.tsv file
--contig_bins CONTIG_BINS
Path to bins.tsv file for contig bins
--out OUT Path to output directory```
MTase-linker
This module links methylation motifs to their corresponding MTase and, when present, their entire RM system.
The MTase-Linker module has additional dependencies that are not automatically installed with Nanomotif. Therefore, before using this module, you must manually install these dependencies using the MTase-linker install command.
The MTase-linker module requires that conda is available on your system.
nanomotif MTase-linker install
This will create a folder named ML_dependencies in your current working directory, containing the required dependencies for the MTase-linker module. You can use the --dependency_dir flag to change the installation location of the ML_dependencies folder.
The installation requires conda to generate required environments.
When the additional dependencies are installed you can run the workflow using MTase-linker run. See here for detailed output description.
QUICK START
ASSEMBLY="path/to/assembly.fasta"
CONTIG_BIN="path/to/contig_bin.tsv"
BIN_MOTIFS="path/to/nanomotif/bin-motifs.tsv"
ML_DEPENDICIES="path/to/ML_dependencies"
OUT="path/to/output"
nanomotif MTase-linker run --assembly $ASSEMBLY --contig_bin $CONTIG_BIN --bin_motifs $BIN_MOTIFS -d $ML_DEPENDICIES --out $OUT
usage: nanomotif MTase-linker run [-h] [-t THREADS] [--forceall FORCEALL] [--dryrun DRYRUN] --assembly ASSEMBLY --contig_bin CONTIG_BIN --bin_motifs BIN_MOTIFS [-d DEPENDENCY_DIR] [-o OUT] [--identity IDENTITY] [--qcovs QCOVS] [--minimum_motif_methylation MINIMUM_MOTIF_METHYLATION]
options:
-h, --help show this help message and exit
-t THREADS, --threads THREADS
Number of threads to use. Default is 1
--forceall FORCEALL Flag for snakemake. Forcerun workflow regardless of already created output (default = False)
--dryrun DRYRUN Flag for snakemake. Dry-run the workflow. Default is False
--assembly ASSEMBLY Path to assembly file.
--contig_bin CONTIG_BIN
tsv file specifying which bin contigs belong.
--bin_motifs BIN_MOTIFS
bin-motifs.tsv output from nanomotif.
-d DEPENDENCY_DIR, --dependency_dir DEPENDENCY_DIR
Path to the ML_dependencies directory created during installation of the MTase-linker module. Default is cwd/ML_dependencies
-o OUT, --out OUT Path to output directory. Default is cwd
--identity IDENTITY Identity threshold for motif prediction. Default is 80
--qcovs QCOVS Query coverage for motif prediction. Default is 80
--minimum_motif_methylation MINIMUM_MOTIF_METHYLATION
Minimum fraction of motif occurrences in the bin that must be methylated for the motif to be considered. Default is 0.5