Skip to content

dduchen/rs-gfatovcf

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rs-gfatovcf

rs-gfatovcf is a tool that finds variants in a Variation Graph. Refer to this link for further details.

How to run this tool

First, clone this repo and move to the resulting directory:

git clone https://github.com/HopedWall/rs-gfatovcf
cd rs-gfatovcf

To compile the program, run:

cargo build --release

In order to run the program, use the following command:

cargo run --release -- -i {input_file.gfa} -o {output_file.vcf}

Where:

  • input_file.gfa is the gfa file that will be used to build the graph
  • output_file.vcf is the vcf file that will be used to store the variants

You can also add one or more of the following arguments:

  • -v (or --verbose) to display various debug messages.
  • -p {path_ids} (or --reference-paths {path_ids}) to specify which paths from the GFA should be used as reference. By default, all paths are used.
  • -m {integer} to set the max number of edges to be traversed, default is 100. WARNING: Greater values may require a very large amount of memory!
  • -j {path} (or --json {path}) to store both the starting graph and its bfs-tree in json format.

How it works

GFAtoVCF performs two main tasks during its execution:

  1. Bubble Detection
  2. Variant Identification

Bubble Detection

A bubble consists of multiple directed unipaths (a path in which all internal vertices are of degree 2) between two vertices. Bubbles are commonly caused by a small number of errors in the centre of reads. More information can be found in this paper

Detecting bubbles is important for converting a file from GFA to VCF since Variants will be found exclusively inside bubbles. In order to detect bubbles, we first build a support tree from the starting graph; then, by looking at how many nodes are at the each level of the tree, we can easily understand whether or not a bubble has been found. The approach will now be described in more detail:

  1. Run a BFS on the graph starting from a node X with no incoming edges, by doing so we obtain a tree rooted in X. We chose a BFS instead of a DFS since we wanted the order of the edges to not matter, and BFS achieves exactly that.
  2. Detect, for each level of the tree, how many nodes are at that level, this will provide crucial information for detecting bubbles in the next step.
  3. Detect bubbles by analyzing the data structure obtained in the previous step; basically, we explore the tree level-by-level. The following situations are possible:
    • no bubble has been detected yet, a level with only 1 node is found (and this node has at least 2 outgoing edges) -> OPEN the bubble
    • a bubble has already been detected, a level with multiple nodes is found -> EXTEND the bubble
    • a bubble has already been detected, a level with only 1 node is found -> CLOSE the bubble

Each bubble is represented internally as a tuple of NodeIDs: (Bubble_Start, Bubble_End), this will allow us to easily explore the nodes in the graph inside the bubble. Note that this script is currently unable to detect nested bubbles, also known as superbubbles.

Variant Identification

Variant Identification consists in finding Variants, which are differences in terms of sequences from a reference genome. In the graph, all genomes, including the reference ones, are described as paths. This second steps mostly invoves exploring different paths and comparing them to the references; instead of comparing all possible paths in the graph, we only focus on paths inside each bubble (since Variants can only be found there). The approach will now be described in more detail:

  1. Consider a reference path: remember that a path is a list of nodes, so ref = [r1,r2...rn]
  2. Compute all possible paths between Bubble_Start and Bubble_End: each path will be represented as: path = [p1,p2...pm]
  3. Compare reference path to each possible path. Consider a generic index i s.t. i <= min(n,m):
    • if ref[i] and path[i] are equal -> no variant found
    • if ref[i+1] and path[i] are equal -> DEL
    • if ref[i] and path[i+1] are equal -> INS
    • if ref[i] and path[i] are different -> SNV

This will be repeated for each reference path and for each bubble.

About

Rust implementation of Flavia95's GFAtoVCF

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Rust 100.0%