2.1. frags package

2.1.1. Submodules

2.1.2. frags.context module

2.1.3. frags.core module

Contains generic functions used by FRAG

frags.core.build_graph(ref, k)[source]
Index each k-mers of a genome

Aho-Corasick implementation, requires pypi package pyahocorasick

Parameters
  • ref (str) – the reference to index

  • k (int) – k-mer size

frags.core.find_hits(graph, a_read)[source]
Find all kmers of ref present in a read

All hits are on the form: start_pos_read: start_pos_ref

Parameters
  • graph (pyahocorasick) – the graph to parse

  • a_read (str) – the read to search in the index

frags.core.get_all_queries(file, nb_proc, k, gap, graph1, graph2=None)[source]

Launch all parallel process to get all queries from a file

Parameters
  • file (string) – the filename of the file where to take sequences from

  • nb_proc (int) – number of precess to run in parallel

  • k (int) – size of kmers

  • gap (int) – maximum authorized gap size for continuous hits

  • graph1 – the graph to parse for genome1

  • graph2 – the graph to parse for genome2

frags.core.get_recombinations(offset_start, offset_end, file, k, gap, graph1, graph2=None)[source]
Main parallelized function that retrieve each read

of a offset range and find matches and breakpoint of them.

Parameters
  • offset_start (int) – where to start taking sequences in the file

  • offset_end (int) – where to stop taking sequences in the file

  • file (string) – the filename of the file where to take sequences from

  • k (int) – size of kmers

  • gap (int) – maximum authorized gap size for continuous hits

  • graph1 – the graph to parse for genome1

  • graph2 – the graph to parse for genome2

frags.core.get_reference(input_file)[source]
Get the reference genome in one-string.

Only take the first sequence of the file. Can be fasta or fastq, gzipped or not.

Parameters

input_file (str) – fasta/fastq file to use as reference

frags.core.next_read(file, offset_start, offset_end)[source]
Return each sequence between offsets range of a file

as a tuple (header, seq) using a generator. Can be fasta or fastq, gzipped or not. WARNING: spaces in headers are replaced by _

Parameters
  • file (str) – fasta/fastq file to read

  • offset_start (int) – offset in the file from where to read

  • offset_end (int) – offset in the file until where to read

frags.core.prepare_blast_file(breakpoint_file, all_queries, minsizeblast)[source]
Prepare a fasta file to be Blasted containing all breakpoints

of at least minsizeblast nucleotides. WARNING: this wrote a FASTA file, regardless of the format of the original file WARNING: headers of original files are modified to add the information of which breakpoint(s) of a specific read are Blasted: original_header_#bp

Parameters
  • breakpoint_file (string) – the filename of the file to be written

  • all_queries (list(Read)) – all queries that may contain breakpoints

  • minsizeblast (int) – minimal size of breakpoint accepted

frags.core.process_blast_res(compressed_file, res_blast_file, sep, all_breakpoints)[source]
Compress Blast result to only show the bests hits and output

result in a fasta-like file. Header is the original header with breakpoint id, e-value and bit-score. WARNING: this wrote a FASTA file, regardless of the format of the original file

Parameters
  • compressed_file (string) – the filename of the file to be written

  • res_blast_file (string) – Blast result file

  • sep (list(char)) – separator to use in the result file

  • all_breakpoints (dict) – dict of Breakpoints/index created before the Blast

frags.core.reverse_complement(seq)[source]

Take an input sequence and return its revcomp

Parameters

seq (str) – the seq to compute

frags.core.write_header(output_file, sep='\t')[source]

Write header of CSV output files

Parameters
  • output_file (str) – CSV file to write in

  • sep (char) – Separator to use between CSV columns

2.1.4. frags.read module

Contains class and functions related to read definition and use

class frags.read.Breakpoint(beg_pos_read, size)[source]

Bases: object

Define a breakpoint.

Parameters
  • beg_pos_read (int) – starting position in the read of this match

  • size (int) – size of the match

output(sep)[source]

Proper output of a line in the result file

Parameters

sep (list(char)) – Separator to use in CSV

class frags.read.Match(beg_pos_read, beg_pos_ref, strand, ref, size, inserts, seq_l)[source]

Bases: object

Define a match.

Parameters
  • beg_pos_read (int) – starting position in the read of this match

  • beg_pos_ref (int) – starting position in the ref of this match

  • strand (int) – strand of this match

  • ref (int) – the ref index for this match

  • size (int) – size of the match

  • inserts (list(int)) – size of potential insertions (possible to have several insertions in ONE match)

  • seq_l (int) – size of the read (needed for rev comp computation)

is_include_in(other)[source]

Check if this match is included in another match

Parameters

other (Match) – the match to compare with

output_read(sep)[source]

Correct output of read infos

Parameters

sep (list(char)) – Separator to use in CSV

output_ref(sep)[source]

Correct output of ref infos

Parameters

sep (list(char)) – Separator to use in CSV

class frags.read.Read(header, sequence)[source]

Bases: object

Define a read.

Parameters
  • header (str) – header of the read

  • sequence (str) – sequence of the read

add_a_match(match)[source]
Test if this match should be added or not.

It must be added if it is not a subpart of an already added other match. In some case, some already added matches are subparts of the match to add. If so, they are removed.

Parameters

match (Match) – the match to add

get_breakpoints()[source]

Populate breakpoints list using all hits, for both strands

get_matches(hits, gap, k, strand, ref)[source]

Populate matches list from all hits, for one strand

Parameters
  • hits (dict) – matching position on read and ref

  • gap (int) – maximum authorized gap size for continuous hits

  • k (int) – k-mer size

  • strand (int) – the strand of this hit

  • ref (int) – the reference index of this hit

get_ref()[source]

Compute the ref of this Read (0=nothing / 1=ref1 / 2=ref2 / 3=ref1 AND ref2)

get_strand()[source]

Compute the strand of this Read (-1=nothing / 0=normal / 1=revcomp / 2=normal AND revcomp)

output(sep)[source]

Proper output of a line of the result file

Parameters

sep (list(char)) – Separator to use in CSV

2.1.5. Module contents

Contains everything related to FRAGS software