MITNANEX's main purpose is to extract mitocondrial Nanopore reads De novo from the WGS, with no need for seeds or reference sequences. It will also return a draft assembly of the mitogenome using Flye.
First, you need to clone this repository and add to PATH:
git clone https://github.com/juanjo255/MITNANEX.git; cd MITNANEX; export PATH=$(pwd):$PATH
The best way to install MITNANEX's dependencies is through a beautiful conda/mamba environment, first you must have Rust installed (https://www.rust-lang.org/tools/install).
For Mac M1 using mamba (you can change it for conda):
CONDA_SUBDIR=osx-64; mamba create -n mitnanex -c conda-forge -c bioconda seqkit seqtk fpa minimap2 miniasm flye gfastats samtools Filtlong
mamba activate mitnanex
pip install pandas maturin biopython scikit-learn utils-mitnanex
It's possible to have problem with the pip module utils-mitnanex, in that case:
pip uninstall utils-mitnanex
cd src/utils_rs; maturin develop
For Linux:
conda create -n mitnanex -c conda-forge -c bioconda Seqkit Seqtk fpa Minimap2 Miniasm Flye Gfastats Samtools Filtlong
conda activate mitnanex
pip install pandas maturin biopython scikit-learn utils-mitnanex
MITNANEX needs the following tools:
Notes:
setup.sh
will create a mamba environment with all the dependencies in the .yml
file.Quick start:
./mitnanex_cli.sh -i path/to/fastQ -p 15000 -m 1000 -t 8 -s 0.6 -g GenomeSize(g|m|k) -w path/to/output
Notes:
For help message:
./mitnanex_cli.sh -h
Options:
-i Input file. [required]
-t Threads. [4].
-p Proportion. For sampling. It can be a proportion or a number of reads (0.3|10000). [0.3].
-m Min-len. Filter reads by minimum length. Read seqkit seq documentation. [-1].
-M Max-len. Filter reads by maximum length. Read seqkit seq documentation. [-1].
-w Working directory. Path to create the folder which will contain all mitnanex information. [./mitnanex_results].
-r Prefix name add to every produced file. [input file name].
-c Coverage. Minimum coverage per cluster accepted. [-1].
-d Different output directory. Create a different output directory every run (it uses the date and time). [False]
-s Mapping identity. Minimum identity between two reads to be stored in the same cluster. [0.6]
-q Min mapping quality (>=). This is for samtools. [-1].
-f Flye mode. [--nano-hq]
-g GenomeSize. This is your best estimation of the mitogenome for read correction with Canu. [required]
* Help.
MITNANEX is a pipeline that depends on other open source tools (see dependencies).
Through this, I will show the results that belong to the assembly of Talaromyces santanderensis mitogenome using MITNANEX from a Nanopore run performed at EAFIT university.
First, it will use seqkit and seqtk to subsample the reads, after that MITNANEX starts with minimap2 finding overlaps between reads. MITNANEX will group reads that have at least a certain level of identity (tweakable parameter), each read will be counted for the "coverage" of the group and each cluster will be represented only by its largest read.
Once all reads are grouped, MITNANEX will only keep at least 3 groups with the highest coverage (tweakable parameter). Given the short length of the mitochondrial genome and its high coverage during WGS, we expect to have most of it in these clusters.
Now with the selected clusters, MITNANEX will use the representative read of each cluster and get its trinucleotidic composition (codon) which will be reduced and normalized by the read length, and reduce its dimensionality to 2 with a PCA such as the classic strategy during metagenomic binning. Here, given the difference between mitochondrial and the nuclear genome, we expect the mitochondrial reads to have an oligocomposition different enough to be separated from the nuclear. The known weakness of Kmeans for outliers made the selection of this clustering algorithm attractive. Thus, using the clustering algorithm Kmeans, with a k set to 2, the cluster with the highest coverage is selected. Below the cluster in yellow was selected.
With the reads collected from the selected clusters, miniasm will assemble unitigs, where we expect to assemble most of the mitogenome (small repeats could be solved at this step if enough coverage is available). These unitigs are mapped against the reads again and then Flye is used to perform a polishing, reducing the bias introduced by the ONT and unitigs structure, leading to a draft of the mitochondrial genome of T. santanderensis. This is the draft that is currently published.
More steps could be added, for example, circularization, polishing with illumina data, but they are not essential for the purpose of this software and are out of its scope.
Clone this repository, run setup.sh
and then:
conda activate mitnanex
pytest