Required File Preparation

Before running nanomotif for motif detection and analysis, ensure that you have prepared the necessary input files. These include a genome assembly file, a methylation pileup file, and a contig-bin relationship file.

Assembly

The assembly file should contain all contigs in FASTA format. Each header should have a unique contig identifier. The sequence should only include standard nucleotide or IUPAC characters (either upper or lower case). Nanomotif has been primarily developed and tested using assemblies generated by Flye.

Requirements:

Format: FASTA
Contains all contigs for evaluation
Contig ID in the FASTA header
IUPAC-compliant characters only

Methylation Pileup

The methylation pileup file indicates how many mapped reads at each position show evidence of methylation. Nanomotif can accept both raw pileup files and bgzipped pileup files (with a .gz extension). If using a bgzipped file, ensure that it is indexed with tabix or use epimetheusepimetheus bgzip compress. Using a bgzipped and indexed file will significantly speed up processing time.

To generate this file:

Map reads (with methylation calls) to the assembly.
Use modkit pileup to create the pileup.

Example commands:

MODCALLS="path/to/reads/with/methylation/calls.bam"
ASSEMBLY="path/to/assembly.fa"
MAPPING="path/to/generated/mapping.bam"
PILEUP="path/to/generated/pileup.bed"

samtools fastq -T MM,ML $MODCALLS | \
    minimap2 -ax map-ont -y $ASSEMBLY - | \
    samtools view -bS | \
    samtools sort -o $MAPPING

modkit pileup --only-tabs $MAPPING $PILEUP

epimetheus bgzip compress -i $PILEUP # --keep to not remove pileup file.

Expected format: The pileup file is a tab-delimited table where each row represents a position on a contig, including information about methylation status.

Running “head” on the pileup file should produce a table similar to the one below:

contig_3	0	1	m	133	-	0	1	255,0,0	133	0.00	0	133	6	0
contig_3	1	2	a	174	+	1	2	255,0,0	174	1.72	3	171	3	0
contig_3	2	3	a	172	+	2	3	255,0,0	172	2.33	4	168	7	0
contig_3	3	4	a	178	+	3	4	255,0,0	178	0.56	1	177	2	0
contig_3	4	5	a	177	+	4	5	255,0,0	177	2.82	5	172	5	0
contig_3	5	6	a	179	+	5	6	255,0,0	179	2.79	5	174	3	2
contig_3	5	6	m	1	+	5	6	255,0,0	1	0.00	0	1	3	180
contig_3	5	6	a	1	-	5	6	255,0,0	1	0.00	0	1	0	156
contig_3	6	7	m	183	+	6	7	255,0,0	183	0.55	1	182	1	0
contig_3	6	7	a	4	-	6	7	255,0,0	4	0.00	0	4	0	153

Considerations:

Use untrimmed reads for mapping to avoid downstream errors.
Running modkit pileup with default parameters may set a low methylation threshold and introduce noise. A filter-threshold of 0.7 is recommended to reduce noise and improve motif detection quality.

Contig-Bin Relationship

For analyses that require binning, you need contig-bin relationship. This maps each contig to its corresponding bin, which is essential for binning-based motif discovery.

This informaiton can be passed in one of three ways:

Contig-Bin File: A file that explicitly maps contigs to bins
Bin FASTA Files: A directory of bin FASTA files, where each file corresponds to a bin
List of bin FASTAs: A list of bin FASTA files, where each file corresponds to a bin

Createing a Contig-Bin File

This file links each contig to its corresponding bin. It is a tab-separated file with two columns and no header:

Column 1: Contig ID
Column 2: Bin ID

If you have a folder of bin FASTA files (one file per bin), you can generate the contig-bin file by extracting contig IDs and their associated bin filenames, then formatting this information into a two-column TSV.

BINS="/path/to/bins/fasta"    # Bins directory
BIN_EXT="fa"                  # Bins file extension
OUT="contig_bin.tsv"          # contig-bin output destination

grep ">" ${BINS}/*.${BIN_EXT} | \
        sed "s/.*\///" | \
        sed "s/.${BIN_EXT}:>/\t/" | \
        awk -F'\t' '{print $2 "\t" $1}' > $OUT

Example output:

contig_1	bin1
contig_2	bin1
contig_3	bin1
contig_4	bin2
contig_5	bin2
contig_6	bin3
contig_7	bin3
contig_8	bin3
contig_9	bin3
contig_10	bin1