==========
User Guide
==========


Overview
========

You can run **Find Recombinations Among Genomes** using the standalone version called:

.. code-block:: none

    frags

You can obtain help by using:

.. code-block:: none

    frags --help


Installation
============

From pip
--------

The suggested way of installing the latest **FRAGS** version is through **pip**:

.. code-block:: none

    pip3 install frags

Then you can use:

.. code-block:: none

    frags --help

From source code
----------------

**FRAGS** is coded in Python. To manually install it from source, get the source and install **FRAGS** using:

.. code-block:: none

    git clone https://gitlab.pasteur.fr/nmaillet/frags/
    cd frags
    python setup.py install

Using without installation
--------------------------

You can download the source code from Pasteur's **Gitlab**: https://gitlab.pasteur.fr/nmaillet/frags/.

In order to directly run **FRAGS** from source, you need to copy file ``tests/context.py`` into ``frags`` folder.

Then, uncomment line 14 of ``frags/FindRecombinationsAmongGenomes.py``. Modify:

.. code-block:: python

    #from context import frags

To:

.. code-block:: python

    from context import frags

Then, from the main **FRAGS** directory, use:

.. code-block:: none

    python3 frags/FindRecombinationsAmongGenomes.py --help

.. warning:: Using without installation is not recommended, as you need all requirements of `requirements.txt` installed locally and you may encounter issues with Sphinx autodoc or other unwanted behaviors.


Classical use
=============

Here are some typical examples of **FRAGS** usage.

Getting help
------------

To access build-in help, use:

.. code-block:: none

    frags --help


Find recombinations with one reference genome
---------------------------------------------

To find all recombinations of a reads file, compare to a single reference genome, use:

.. code-block:: none

    frags -i read_file.fasta -r ref_file.fasta


Find recombinations with two reference genomes
----------------------------------------------

To find all recombinations of a reads file, compare to two reference genomes, use:

.. code-block:: none

    frags -i read_file.fasta -r ref_file.fasta ref_file2.fasta


Using multi fasta/fastq files
-----------------------------

To find all recombinations of several reads files, use:

.. code-block:: none

    frags -i read_file1.fasta read_file2.fasta -r ref_file.fasta ref_file2.fasta


Using Blast to analyze breakpoints
----------------------------------

To perform a Blast on identified breakpoints against the host genome, use:

.. code-block:: none

    frags -i read_file.fasta -r ref_file.fasta -b -t host_file.fasta

See :ref:`blast` for more information.


Working principle of FRAGS
==========================

FRAGS is a python tool taking fasta/fastq files of reads as input and one or two reference genomes. It then identifies chimeric reads (reads composed of non-adjacent fragments (or matches), coming either from on one or the two references genomes) and potential breakpoints (insert between matches). Optionally, breakpoints can then be Blasted again the host genome, if provided.


.. _kmerconcept:

K-mer concept
-------------

Technically, FRAGS identifies similarities (matches) between a fragment of a read and a reference using k-mers. A k-mer is a small DNA part of the read. Each reads are decomposed in overlapping k-mers and each k-mer is searched in references. The smaller a k-mer is, the most probable it will be found in the reference. But using really small k-mers (i.e. < 20) leads to identifying it in the reference 'by chance'. K-mers size must usually be greater than 25 to ensure robust results.

Note that the search is performed using Aho-Corasick algorithm, enabling a fast and error-free identifications. K-mers that are matching in the reference and are contiguous (modulo :ref:`gap`) in both the reference and the read are merged together.
Due to Aho-Corasick algorithm, **each k-mers of a reference should be unique**. If a k-mer is present more than one time in a reference, **only the first occurrence of this k-mer is used in upcoming the comparisons**. Fortunately, this situation is most of the time encountered in low complexity part of the genome such as poly(A) tail or GA-dinucleotide repeats.


Matches
-------
A **match** is a fragment of the read that is almost identical to a fragment of a reference (modulo :ref:`gap`). All matches are written in result files, with there starting position and size (see ``outputfolder`` for more details).

Note that each match is unique, and that no match can be included in another one. Accordingly, when a match is found on a read, if the exact same fragment was already found elsewhere (different position of the same reference or into the other reference), only the first match is kept.

| A read composed as: ``XXXXXXXXXXXXyyyyyyyyyyyyyyyyyy``, with ``X`` matching on reference 1 and 2, and ``y`` not matching, will produce the result:

| ``XXXXXXXXXXXX`` is matching on reference 1
| ``yyyyyyyyyyyyyyyyyy`` is not matching

In the same way, if a match is included into a bigger one, only the bigest is kept.

| A read composed as: ``XXXXxxxxxXXXXyyyyyyyyyyyyyyyyyy``, with ``xxxxx`` matching on reference 1, ``XXXXxxxxxXXXX`` on reference 2, and ``y`` not matching, will produce the result:

| ``XXXXxxxxxXXXX`` is matching on reference 2
| ``yyyyyyyyyyyyyyyyyy`` is not matching


Breakpoints
-----------

A **breakpoints** is what stand between two matches, where a recombination probably occurs. Size of breakpoints can be of three types.


Empty breakpoint
................
A breakpoint of size 0 occurs when two contiguous fragments of a read match at different locations.

A read composed as: ``XXXXXXXXXXXXyyyyyyyyyyyyyyyyyy``, with ``XXXXXXXXXXXX`` of size 12 (from nucleotides 1 to 12) matching on reference 1 and ``yyyyyyyyyyyyyyyyyy`` on reference 2 (or at a different location in ref1) will produce the following breakpoint:

| 12:0 i.e. breakpoint start after the nucleotide at position 12, for a size of 0 nucleotide.


Positive breakpoint
...................
A breakpoint of positive size occurs when two non contiguous fragments of a read match.

A read composed as: ``XXXXXXXXXXXXyyyyyyyZZZZZZZZZZ``, with ``XXXXXXXXXXXX`` of size 12 (from 1 to 12) matching on a reference, ``yyyyyyy`` (7 nucleotides) not matching on any reference and ``ZZZZZZZZZZ`` matching on a reference will produce the following breakpoint:

| 12:7 i.e. breakpoint start after the nucleotide at position 12 and is composed of the 7 **following** nucleotides.

.. _negbp:

Negative breakpoint
...................
A breakpoint of negative size occurs when two overlapping fragments of a read match at different locations.

A read composed as: ``XXXXXXXXXXXXyyyyyyyZZZZZZZZZZ``, with ``XXXXXXXXXXXXyyyyyyy`` of size 19 (from 1 to 19) matching at a location and ``yyyyyyyZZZZZZZZZZ`` (17 nucleotides) matching at a different location will produce the following breakpoint:

| 17:-7 i.e. breakpoint start after the nucleotide at position 17 and is composed of the 7 **preceding** nucleotides.

Breakpoints can be further investigated using Blast.


.. _blast:

Blast analyze
-------------
Breakpoints can then be Blasted against the host genome where the recombination append. To do so, the host genome must be inputted in FRAGS (-t option) and Blast option switched on (-b option).

Then, each breakpoint of at least -m option nucleotides is locally Blasted against the host genome.

Three files are produced when using Blast: ``breakpoints.fasta``, ``res_blast.csv`` and ``compressed.fasta``. See :ref:`outputfolder` for more details about results files.

.. warning::
    Blast in command line must be available on your computer. See `Blast installation page <https://www.ncbi.nlm.nih.gov/books/NBK279671/>`_.

.. warning::
    A Blast database is required to perform local Blast. This database must be at the same location than the host genome file. If not, the database will be automatically created at this location.


.. _outputfolder:

Output of FRAGS
---------------
Main results are in CSV files. Three CSV files contain respectively results for reads that have matches coming from the first reference only, the second reference only or both references. Another CSV file contains headers of reads that did not match anywhere. Finally, three files are produced regarding Blast uses: ``breakpoints.fasta``, containing the breakpoint fragments to Blast, ``res_blast.csv`` containing the result of Blast (if required) on ``breakpoints.fasta`` and ``compressed.fasta``, a compressed version of ``res_blast.csv`` keeping only results of the best e-value/bit-score for each input sequences and produces a fasta-like file.

.. note::
    ``breakpoints.fasta`` and ``compressed.fasta`` are fasta files, regardless of input files being fasta or fatsq files. The starting character of headers in these files are always ``>``.

In ``compressed.fasta``, headers are created as follow: ``OriginalHeader_IdOfBreakpoint|E-value|Bit-score``. ``IdOfBreakpoint`` is the identifier of the breakpoint in the read. If there is three breakpoints in a read, identifiers will be ``1``, ``2`` and ``3``. ``E-value`` and ``Bit-score`` are respectively the e-value and the bit-score of the best(s) hit(s) for this breakpoint.
The sequence line is composed of all different Blast hits with the same e-value and bit-score, separated by tabulations. Tabulation and ``|`` are configurable through :ref:`csv`.

.. warning::
    Because of Blast limitation, headers of input reads must not contain any spaces. FRAGS will replace all spaces in headers by ``_`` symbol.

The CSV file are composed of 9 columns:

* ``Read_header`` contains header of the input read that matches (see warning below)
* ``Reverse_complement`` indicates if all matches of this read are in normal strand (0), in revers complement (1) or some in normal strand and some in reverse complement (2)
* ``Number_of_breakpoints`` indicates the total number of breakpoints identified of this read
* ``Breakpoints_positions`` indicates the position and size of each identified breakpoint of this read (see :ref:`possize`)
* ``Matches_read_positions`` indicates the position and size (in the read) of each identified matches of this read (see :ref:`possize`). Each match is preceded by the strand in parentheses, i.e. ``(-)`` for reverse complement or ``(+)`` for normal strand, and the id in parentheses of the reference where it matches, i.e. ``(1)`` for the first inputted reference genome or ``(2)`` for the second inputted reference genome
* ``Matches_ref_positions`` indicates the position and size (in the ref) of each identified matches of this read (see :ref:`possize`). Each match is preceded by the id in parentheses of the reference where it matches, i.e. ``(1)`` for the first inputted reference genome or ``(2)`` for the second inputted reference genome
* ``Matches_size`` indicates all sizes of all matches of this read
* ``Insertions`` indicates sizes of insertions that append in matches of this read. Note that a single match can have several insertions. They are then separated by ``:`` (configurable through :ref:`csv`).
* ``Blast_results_breakpoints`` indicates which breakpoints of this read have a Blast hit

.. warning::
    Because of Blast limitation, headers of input reads must not contain any spaces. FRAGS will replace all spaces in headers by ``_`` symbol.


.. _possize:

Nomenclature of positions
.........................
The nomenclature used to represent a match or a breakpoint is the following:
``X:Y`` where ``X`` is the index **before** the starting nucleotide, ``Y`` the size of the match/breakpoint and ``:`` the configurable separator (see :ref:`csv` for configuration).

A read composed as: ``XXXXXXXXXXXXyyyyyyyZZZZZZZZZZ``, with ``XXXXXXXXXXXX`` of size 12 (from 1 to 12) matching on a reference, ``yyyyyyy`` of size 7 (from 13 to 19) not matching on any reference and ``ZZZZZZZZZZ``  of size 10 (from 20 to 29) matching on a reference will produce the following matches/breakpoint:

| ``Match 0:12`` i.e. match that start after the nucleotide at position 0 and is composed of the 12 following nucleotides.

| ``Breakpoint 12:7`` i.e. breakpoint that start after the nucleotide at position 12 and is composed of the 7 following nucleotides.

| ``Match 19:10`` i.e. match that start after the nucleotide at position 19 and is composed of the 10 following nucleotides.


.. _example:

Example of a CSV result line
............................

``>read_name   1   2   59:20|114:47    (-)(1)25:34|(-)(2)79:35|(-)(2)161:39    (1)534:34|(2)522:35|(2)336:39 34|35|39  0|3|0 2``
The read ``read_name`` matches only in reverse complement (``1``). It has ``2`` breakpoints, from nucleotide 60 to 79 (``59:20``, start at nucleotide 60, size of 20 nucleotide) and from nucleotide 115 to 161 (``114:47``).

The first match is in reverse complement (``(-)``) on the first reference genome (``(1)``) from positions 26 to 59 (``25:34``).

The second match is in reverse complement (``(-)``) on the second reference genome (``(2)``) from positions 80 to 114 (``79:35``).

The third match is in reverse complement (``(-)``) on the second reference genome (``(2)``) from positions 162 to 200 (``161:39``).

The first match is on the first reference genome (``(1)``) from positions 535 to 568 (``534:34``).

The second match is on the second reference genome (``(2)``) from positions 523 to 557 (``522:35``).

The third match is on the second reference genome (``(2)``) from positions 337 to 375 (``336:39``).

The first match has a length of 34 nucleotides, the second 35 and the third 39 (``34|35|39``)

The first match has no insertion, the second one an insertion of 2 nucleotides, and the third one, 0 (``0|3|0``)

The second breakpoint had a hit with Blast (``2``)

.. note::
    In CSV files, symbols ``:``, ``|`` and tabulation are configurable. See :ref:`csv` for this.


Options
=======

Here are all available options in **FRAGS**:

Default options
---------------

**-h, -\-help**: Show this help message and exit.

**-i, -\-inputfiles**: Input reads files. See :ref:`inputfiles` for more information.

**-r, -\-reffiles**: Input reference files. See :ref:`reffiles` for more information.

**-o, -\-outputfolder**: Output folder containing result files. See :ref:`outputfolder` for more information.

**-k, -\-kmer**: K-mer length of the search (default: 30). See :ref:`kmer` for more information.

**-g, -\-gap**: Gap length to consider not contiguous hits as contiguous (default: 10). See :ref:`gap` for more information.

**-p, -\-processes**: Number of parallel processes to use (default: 1). See :ref:`processes` for more information.

**-q, -\-quiet**: No standard output, only error(s).

**-v, -\-verbose**: Increase output verbosity. See :ref:`verbose` for more information.

**-\-version**: Show program's version number and exit.

Blast options
-------------

See :ref:`blast` for more detailed informations.

**-b, -\-blast**: Use Blast to analyze breakpoints greater than -m/-\-minsizeblast argument.

**-t, -\-host**: Host genome file. Required if -b/-\-blast argument is used. Note: a Blast database will be created at this location.

**-m, -\-minsizeblast**: Minimum size of breakpoint to Blast (default: 20).

CSV options
-----------

See :ref:`csv` for more detailed informations.

**-s, -\-posizesep**: Separator between position and size (default: :).

**-f, -\-fieldsep**: Field separator inside columns (default: \|).

**-c, -\-csvsep**: Column separator (default: \\t).


Detailed options
-----------------

.. _inputfiles:

Input files
...........

Input files should be fasta or fastq files. They can be compressed in gzip.

To input several files, use:

.. code-block:: none

    frags -i read_file1.fastq read_file2.fasta read_file3.fastq.gz ...

This will be equivalent to first merge all input files into a big one.


.. _reffiles:

Reference files
...............

Reference genomes files should be fasta or fastq files. They can be compressed in gzip.

You can input one or two reference genomes.

.. code-block:: none

    frags -i ... -r ref1.fastq ref2.fasta.gz ...

.. note::
    Genomes must be composed of a single sequence. If a reference file is composed of more than one sequence, only the first one will be used.


.. _kmer:

K-mers length
.............

K-mers size used to find similarities (matches) between a read and a reference. Note that k-mers size should ideally be greater than 25 to ensure robust results. See :ref:`kmerconcept` to understand how similarities are found and used.


.. _gap:

Gap length
..........

The gap option allows none contiguous sub-parts of a read to be considered as contiguous, if the size between the different parts are smaller than -\-gap option.

| A read composed as: ``XXXXXXXXXXXXyyyXXXXXXXXXXXXXXyyyyyXXXXXXXX``, with positions ``X`` matching on the reference and ``y`` not matching, will produce the result (using gap at 3):

| ``XXXXXXXXXXXXyyyXXXXXXXXXXXXXX`` is matching
| ``yyyyy`` is not matching
| ``XXXXXXXX`` is matching

It will produce the result (using gap option at 2):

| ``XXXXXXXXXXXX`` is matching
| ``yyy`` is not matching
| ``XXXXXXXXXXXXXX`` is matching
| ``yyyyy`` is not matching
| ``XXXXXXXX`` is matching


.. _processes:

Multi-process
.............

FRAGS can be launched in a single core process (default behavior) or on many cores at the same time in order to speed-up some part of the computation.

Using 1 process (default):

.. code-block:: none

    $ frags -i reads.fastq -r ref1.fasta ref2.fasta -o res -p 8 -b -t host.fna
    Query time: 19.16s
    Write file to be Blasted: 0.03s
    Blast time: 0.78s
    Write compressed Blast results: 0.00s
    Write results files: 0.06s
    Total time: 20.06s

Using 2 processes:

.. code-block:: none

    $ frags -i reads.fastq -r ref1.fasta ref2.fasta -o res -p 8 -b -t host.fna -p 2
    Query time: 10.35s
    Write file to be Blasted: 0.03s
    Blast time: 0.69s
    Write compressed Blast results: 0.00s
    Write results files: 0.06s
    Total time: 11.15s

Using 4 processes:

.. code-block:: none

    $ frags -i reads.fastq -r ref1.fasta ref2.fasta -o res -p 8 -b -t host.fna -p 4
    Query time: 5.79s
    Write file to be Blasted: 0.03s
    Blast time: 0.61s
    Write compressed Blast results: 0.00s
    Write results files: 0.06s
    Total time: 6.50s

Using 8 processes:

.. code-block:: none

    $ frags -i reads.fastq -r ref1.fasta ref2.fasta -o res -p 8 -b -t host.fna -p 8
    Query time: 5.52s
    Write file to be Blasted: 0.03s
    Blast time: 0.62s
    Write compressed Blast results: 0.00s
    Write results files: 0.06s
    Total time: 6.26s

Not that some parts of the computations are not parallelized and increasing the number of processes will not speed-up the computation after a certain point.


.. _verbose:

Verbosity
.........

Verbosity can be increased or decreased. The output file is not affected by **-v** or **-q** options.

With default verbosity level (no **-v** nor **-q** option), the output is:

.. code-block:: none

    $ frags -i reads.fastq -r ref1.fasta ref2.fasta -o res -p 8 -b -t host.fna
    Total time: 6.61s
    $

Increasing verbosity, *i.e.* using **-v**, adds information about time. For example:

.. code-block:: none

    $ frags -i reads.fastq -r ref1.fasta ref2.fasta -o res -p 8 -b -t host.fna -v
    Query time: 5.56s
    Write file to be Blasted: 0.03s
    Blast time: 0.61s
    Write compressed Blast results: 0.00s
    Write results files: 0.06s
    Total time: 6.28s
    $

Decreasing verbosity, *i.e.* using **-q** option, removes all information but errors. For example:

.. code-block:: none

    $ frags -i reads.fastq -r ref1.fasta ref2.fasta -o res -p 8 -b -t host.fdna -q
    Error, host file host.fdna not found
    $ frags -i reads.fastq -r ref1.fasta ref2.fasta -o res -p 8 -b -t host.fna -q
    $ 


.. _csv:

CSV configuration of output files
.................................

Main results files are in CSV format. Three separators are configurable.
The first one (**-c**) is the column separator. By default columns are separated by tabulations.

The second one (**-s**) is the separator between position and size for matches or breakpoints (and potentially also for insertions, see :ref:`outputfolder`). See :ref:`possize` for more informations. By default, position and size are separated by ``:``.

The third one (**-f**) is used to separate informations about the different matches in a same cell. By default, different informations in a single cell are separated by ``|``.

See :ref:`example` for a better understanding.

.. note::
    In ``compressed.fasta`` output file, option **-c** and **-f** are used. See :ref:`outputfolder` for this.