MITNANEX

MITochondrial NANopore reads EXtractor

Status GitHub Issues GitHub Pull Requests License

Table of Contents

🧐 About

MITNANEX's main purpose is to extract mitocondrial Nanopore reads De novo from the WGS, with no need for seeds or reference sequences. It will also return a draft assembly of the mitogenome using Flye.

🏁 Getting Started

Installing

First, you need to clone this repository and add to PATH:

git clone https://github.com/juanjo255/MITNANEX.git; cd MITNANEX; export PATH=$(pwd):$PATH

Conda/mamba

The best way to install MITNANEX's dependencies is through a beautiful conda/mamba environment, first you must have Rust installed (https://www.rust-lang.org/tools/install).

For Mac M1 using mamba (you can change it for conda):

CONDA_SUBDIR=osx-64; mamba create -n mitnanex -c conda-forge -c bioconda seqkit seqtk fpa minimap2 miniasm flye gfastats samtools Filtlong
mamba activate mitnanex
pip install pandas maturin biopython scikit-learn utils-mitnanex

It's possible to have problem with the pip module utils-mitnanex, in that case:

pip uninstall utils-mitnanex
cd src/utils_rs; maturin develop

For Linux:

conda create -n mitnanex -c conda-forge -c bioconda Seqkit Seqtk fpa Minimap2 Miniasm Flye Gfastats Samtools Filtlong
conda activate mitnanex
pip install pandas maturin biopython scikit-learn utils-mitnanex

Dependencies

MITNANEX needs the following tools:

  1. Seqkit
  2. Seqtk
  3. fpa
  4. Minimap2
  5. Miniasm
  6. Flye
  7. Pandas
  8. Gfastats
  9. Samtools
  10. Filtlong
  11. Maturin
  12. Biopython
  13. scikit-learn
  14. utils-mitnanex

Notes:

🎈 Usage

Quick start:

./mitnanex_cli.sh -i path/to/fastQ  -p 15000 -m 1000 -t 8 -s 0.6 -g GenomeSize(g|m|k) -w path/to/output

Notes:

For help message:

./mitnanex_cli.sh -h
Options:
    -i        Input file. [required]
    -t        Threads. [4].
    -p        Proportion. For sampling. It can be a proportion or a number of reads (0.3|10000). [0.3].
    -m        Min-len. Filter reads by minimum length. Read seqkit seq documentation. [-1].
    -M        Max-len. Filter reads by maximum length. Read seqkit seq documentation. [-1].
    -w        Working directory. Path to create the folder which will contain all mitnanex information. [./mitnanex_results].
    -r        Prefix name add to every produced file. [input file name].
    -c        Coverage. Minimum coverage per cluster accepted. [-1].
    -d        Different output directory. Create a different output directory every run (it uses the date and time). [False]
    -s        Mapping identity. Minimum identity between two reads to be stored in the same cluster. [0.6]
    -q        Min mapping quality (>=). This is for samtools. [-1].
    -f        Flye mode. [--nano-hq]
    -g        GenomeSize. This is your best estimation of the mitogenome for read correction with Canu. [required]
    *         Help.

Algorithm overview

How does MITNANEX work?

MITNANEX is a pipeline that depends on other open source tools (see dependencies).

Through this, I will show the results that belong to the assembly of Talaromyces santanderensis mitogenome using MITNANEX from a Nanopore run performed at EAFIT university.

First, it will use seqkit and seqtk to subsample the reads, after that MITNANEX starts with minimap2 finding overlaps between reads. MITNANEX will group reads that have at least a certain level of identity (tweakable parameter), each read will be counted for the "coverage" of the group and each cluster will be represented only by its largest read.

Once all reads are grouped, MITNANEX will only keep at least 3 groups with the highest coverage (tweakable parameter). Given the short length of the mitochondrial genome and its high coverage during WGS, we expect to have most of it in these clusters.

Cluster filter by coverage

Now with the selected clusters, MITNANEX will use the representative read of each cluster and get its trinucleotidic composition (codon) which will be reduced and normalized by the read length, and reduce its dimensionality to 2 with a PCA such as the classic strategy during metagenomic binning. Here, given the difference between mitochondrial and the nuclear genome, we expect the mitochondrial reads to have an oligocomposition different enough to be separated from the nuclear. The known weakness of Kmeans for outliers made the selection of this clustering algorithm attractive. Thus, using the clustering algorithm Kmeans, with a k set to 2, the cluster with the highest coverage is selected. Below the cluster in yellow was selected.

Kmeans on PCA

With the reads collected from the selected clusters, miniasm will assemble unitigs, where we expect to assemble most of the mitogenome (small repeats could be solved at this step if enough coverage is available). These unitigs are mapped against the reads again and then Flye is used to perform a polishing, reducing the bias introduced by the ONT and unitigs structure, leading to a draft of the mitochondrial genome of T. santanderensis. This is the draft that is currently published.

T. santanderensis draft

More steps could be added, for example, circularization, polishing with illumina data, but they are not essential for the purpose of this software and are out of its scope.

⛏️ Testing

Clone this repository, run setup.sh and then:

conda activate mitnanex
pytest