Find Homologous Contigs By Gene Annotation

python script to identify genes that are present on multiple contigs

Introduction

Genome assemblies of non-inbred diploid individuals may result in the assembly of "homologous contigs", contig pairs that capture homologous genomic regions but are sufficiently divergent from each other as to be assembled independently. This phenomenom often results in a genome length that is longer than expected. Unlike other tools which merge homologous regions based on DNA sequence similarity (Redundans, HaploMerger), my method uses genome annotation information to identify pairs of homologous contigs. The source of the genome annotation is flexible but I recommend a conservative approach based on identifying conserved core (single-copy) genes using BUSCO. Contigs that lack BUSCO genes will be missed by this analysis so adding an transcript-based annotation is more sensitive, but also prone to error due to true gene duplications or ambiguous mapping of transcripts.

This approach has been effective at deduplicating primary contigs in the Aedes aegypti FALCON-Unzip assembly, see my poster presented at AGBT 2017.

User Inputs

At the command line, the user must specify the file paths for the following inputs:

BED file containing annotated genes and their gene spans (do not include mRNA, exons etc!). The shell script BUSCOfull2BED will produce this for duplicated genes.
Tab delimited text file with contig IDs and lengths.

Output

Tab delimited text file with the following fields:

contigA ID
contigB ID
contigA length (always the longer contig)
contigB length
number of genes shared for contig pair
comma-delimited list of shared gene IDs

Downstream analysis recommendations

A non-redundant list of contigB IDs is a reasonable starting list of contigs that should be flagged as allelic (haplotype) variants. Simply substracting their total length from the genome assembly length may result in a more accurate estimate of genome size.

NOTE: This approach has been fruitful for PacBio assemblies with contig N50 stats on the order of > 1Mbp. I cannot vouch for its effectiveness on more fragmented NGS-bases assemblies.

Extension (DMRP)

The original HomolContigsByAnnotation, reported a list of contig pairs that contain duplicated BUSCO. My extension carries out a LASTZ alignment between the two contigs and plots a dotplot (using R's ggplot2), highligting the position of identified BUSCO genes. This extension includes two scripts: compareContigPair.pl (requires BioPerl) and compareContigPairPlotAlignment.Rscript.

Usage:

compareContigPair.pl HomolContigsByAnnotation_res.txt contigs.fixLineLength.fasta full_table_curated.BUSCO.tsv

the contigs file (contigs.fasta) must be well formated. Just to make sure you can use EMBOSS's seqret:

seqret -auto contigs.fasta contigs.fixLineLength.fasta

THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
BUSCOfull2BED.sh		BUSCOfull2BED.sh
EXAMPLE_BUSCOgenes.bed		EXAMPLE_BUSCOgenes.bed
EXAMPLE_contigLengths.txt		EXAMPLE_contigLengths.txt
FindSharedGenes_withLength.py		FindSharedGenes_withLength.py
Kingan-AGBT-2017-A-high-quality-genome-assembly-of-SMRT-sequences-diploid-mosquito-Aedes-aegypti.pdf		Kingan-AGBT-2017-A-high-quality-genome-assembly-of-SMRT-sequences-diploid-mosquito-Aedes-aegypti.pdf
README.md		README.md
compareContigPair.pl		compareContigPair.pl
compareContigPairPlotAlignment.Rscript		compareContigPairPlotAlignment.Rscript
filterContigs.pl		filterContigs.pl
getContigsWithMissingBUSCOS.pl		getContigsWithMissingBUSCOS.pl
sendContigs2HoplotigPool.sh		sendContigs2HoplotigPool.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Find Homologous Contigs By Gene Annotation

Introduction

User Inputs

Output

Downstream analysis recommendations

Extension (DMRP)

About

Releases

Packages

Languages

diriano/HomolContigsByAnnotation

Folders and files

Latest commit

History

Repository files navigation

Find Homologous Contigs By Gene Annotation

Introduction

User Inputs

Output

Downstream analysis recommendations

Extension (DMRP)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages