Transposable elements (TEs)—selfish DNA sequences that can move within the genome—comprise a large proportion of the genomes of many organisms. Although low-coverage whole genome sequencing can be used to survey TE composition, it is non-economical for species with large quantities of DNA. Here, we utilize restriction-site associated DNA sequencing (RADSeq) as an alternative method to survey TE composition.
In our paper (Chak et al, 2019), we demonstrate in silico that double digest restriction-site associated DNA sequencing (ddRADseq) markers contain the same TE compositions as whole genome assemblies across arthropods. Then, we show empirically using eight Synalpheus snapping shrimp species with large genomes that TE compositions from ddRADseq and low-coverage whole genome sequencing are comparable within and across species.
This bioinformatic pipeline, TERAD, is used to extract TE compositions from RADseq data.
IMPORTANT: Our pipeline used only one end of the pair-end reads to remove the bias from the rarity of EcoRI cut sites among known Arthropod TEs. We found that the cut frequency of EcoRI was lower than that of MspI (56% vs. 85%, respectively). Therefore, we analyzed only the EcoRI-ends of the paired-end reads to include only TEs that did not have an EcoRI restriction site. For your organism, you should decide which of you enzymes are the rare cutter and use that side of the pair-end for the analysis.
Place contents in, for example, ~/Desktop/TERAD
Install RepeatMasker (http://www.repeatmasker.org/RMDownload.html) and its dependable programs including:
Then run the RepeatMasker configuration script.
Our program has been tested using RepeatMasker,v 1.332 2017/04/17 with RMBlast 2.9.0-p1
https://github.com/weizhongli/cdhit
Our program has been tested using CD-HIT version 4.7 (built on Jan 25 2018).
Once R is installed, in terminal start R by typing R
, then in the R console, type the following:
install.packages("readr"); install.packages("plyr"); install.packages("fitdistrplus")
If you are installing on an HPC system, you may need to choose custom library location (for example see: https://www.osc.edu/resources/getting_started/howto/howto_install_local_r_packages)
To edit the bash_profile: nano $HOME/.bash_profile
add these lines at the end depending on the location where you have installed these programs:
export PATH=$PATH:/Users/solomon/Documents/programs/cdhit-master
export PATH=$PATH:/Users/solomon/Documents/programs/RepeatMasker
To save: type Control+ O
, then Y
, then Control+ X
nano .bashrc
Add these lines:
PATH=/XXXX/XXXX/RepeatMasker:$PATH
PATH=/XXXX/XXXX/cdhit:$PATH
Test run:
(You may need to do: chmod +x TERAD)
cd ~/Desktop/TERAD
./TERAD test_file.fasta 4 ./arthro_ES_ND_PV_classified.fa none
Inputs are:
- A file to search for TE in fasta format
- The number of cores to use
- The custom library to search for TE or "none"
- Query species for RepeatMasker or "none"
** Use either inputs 3 or 4 and leave the other as “none.
Make sure your input file, TERAD, extract_cdhit2.R, and RAD_TE_summary.R in the same folder as well as your custom TE library in fasta format if you wish to use one.
Test run:
./TERAD test_file.fasta 4 ./arthro_ES_ND_PV_classified.fa none
or
./TERAD test_file.fasta 4 none arthropods
The main output is test_file.fasta.cd.RAD_TE.summary2
.
It is a .csv file that summarizes the proportions of RAD tags for each major TE subclass. .int
and .ext
indicate tags whether the restriction enzyme cut sites (e.g., EcoRI if we used EcoRI-ends of the paired-end ddRAD reads) were internal
or external
to the TE. We only analyzed TEs that don't have the EcoRI sites (i.e., the .ext
proportions) to avoid the low cut frequency of EcoRI in known Arthropod TEs.
Column names | Notes |
---|---|
sample | Sample name |
lib.depth | Total number of RAD tag |
DNA | DNA transposons |
RC | Helitron |
LTR | LTR retrotransposon |
LINE | Long interspersed nuclear elements |
SINE | Short interspersed elements |
RNA | RNA |
UO | Unknown/Other |
Sim | Simple repeat (Microsatellites) |
Sat | Satellites |
LC | Low complexity repeats |
TE | Total TE |
TE.int | Total TE where restriction sites are internal to the TE (inside the TE) |
TE.ext | Total TE where restriction sites are external to the TE (outside the TE) |
NoTE | RAD tags that have no TEs |
From cd-hit
- test_file.fasta.cd
- test_file.fasta.cd.clstr
From extract_cdhit2.R
- test_file.fasta.cd.summary2
- test_file.fasta.cd.summary1
From RepeatMasker
- test_file.fasta.cd.cat
- test_file.fasta.cd.out
- test_file.fasta.cd.tbl
- test_file.fasta.cd.masked
- test_file.fasta.cd.RM.progress
From RepeatProteinMask
- test_file.fasta.cd.masked.protM.progress
- test_file.fasta.cd.masked.annot
- test_file.fasta.cd.masked.masked
- test_file.fasta.cd.masked.rmsimple.cat.all
From RAD_TE_summary.R
test_file.fasta.cd.RAD_TE.summary1- test_file.fasta.cd.RAD_TE.summary2
Email [email protected]