The PMFFRC takes the genomic sequencing Reads compression ratio as the optimization goal. It performs joint clustering compression on the Reads in multiple FastQ files by modeling the system memory, the peak memory overhead of the cascading compressor, the numeral of files, and the numeral of sequencing reads in the actual application scenarios.
PMFFRC (Parallel Multi-FastQ-Files Reads Clustering).
Firstly, clone our tools from GitHub:
git clone https://github.com/fahaihi/PMFFRC.git
Secondly, turn to PMMFRC directory:
cd PMFFRC
Thirdly, Run the following command:
chmod +x install.sh
./install.sh
Finally, Configure the environment variables with the following command:
export PATH=$PATH:`pwd`/
export PMFFRC_PATH="`pwd`/"
source ~/.bashrc
Warning!: PMFFRC relies on /bin/time
memory and time evaluation commands, make sure
that running the following Linux command produces the correct results before using PMFFRC.
/bin/time -v -p echo "hello pmffrc"
If "/usr/bin/time: No such file or directory" is displayed, make sure you have sudo permission to run the following command:
sudo yum install time
PMFFRC algorithm scripts currently only support HARC (2018)
, Spring (2019)
, FastqCLS (2021)
, and MSTCOM (2022)
algorithms. To run ./PMFFRC
, please configure the file script in the PMFFRC/src/*_compressor.sh
and PMFFRC/src/*_decompressor.sh
directory.
The detailed configuration scripts for the four dedicated compressors are as follows:
Install and configure the HARC compressor firstly.
cd src
git clone https://github.com/shubhamchandak94/HARC.git
cd HARC
chmod +x install.sh
./install.sh
Next, run the following script to check whether HARC is installed successfully.
./harc -c ${PMFFRC_PATH}data/SRR11995098_test.fastq -p -t 8
Finally, switch to the following file directory and check if there is a SRR11995098_test.harc
compressed file.
cd ${PMFFRC_PATH}data
Note: The HARC compressor depends on 7z, if the run shows './harc: line 104: 7z: command not found', make sure you have sudo permission to run the following command:
sudo yum install p7zip p7zip-plugins
Install and configure the SPRING compressor firstly.
cd ${PMFFRC_PATH}src
git clone https://github.com/shubhamchandak94/SPRING.git
On Linux with cmake installed and version at least 3.9 (check using cmake --version
):
cd SPRING
mkdir build
cd build
cmake ..
make
On Linux with cmake not installed or with version older than 3.12:
cd SPRING
mkdir build
cd build
wget https://cmake.org/files/v3.12/cmake-3.12.4.tar.gz
tar -xzf cmake-3.12.4.tar.gz
cd cmake-3.12.4
./configure
make
cd ..
./cmake-3.12.4/bin/cmake ..
make
Next, run the following script to check whether SPRING is installed successfully.
./spring -c -i ${PMFFRC_PATH}data/SRR11995098_test.fastq -o ${PMFFRC_PATH}data/SRR11995098_test.spring
Finally, switch to the following file directory and check if there is a SRR11995098_test.spring
compressed file.
cd ${PMFFRC_PATH}data
Install and configure the Mstcom compressor firstly.
cd ${PMFFRC_PATH}src
git clone https://github.com/yuansliu/mstcom.git
cd mstcom
make
Next, run the following script to check whether MSTCOM is installed successfully.
./mstcom e -i ${PMFFRC_PATH}data/SRR11995098_test.fastq -o ${PMFFRC_PATH}data/SRR11995098_test.mstcom
Finally, switch to the following file directory and check if there is a SRR11995098_test.mstcom
compressed file.
cd ${PMFFRC_PATH}data
notes:
(a) FastqCLS compressor entire FastQ file, we changed the FastqCLS script only for Reads compression,
FastqCLS located in src/fastqcls
.
(b) The FastqCLS relies on the zpaq compressor, if your device is not configured with zpaq, download and install zpaq firstly.
(c) The FastqCLS compressor relies on Python scripts, and you need to install Python on your system.
After configuring the PMFFRC/src/*_compressor.sh
and PMFFRC/src/*_decompressor.sh
files, run PMFFRC
with the following command:
Compression-> Compress Multi-FastQ Files:
PMFFRC [-c multi-fastQ-files-path]
[-y cascading algorithm used. "harc"]
[-t num_threads. --Default=20]
[-u user_defined_ram_size. --Default 10GB]
[-q write quality values and read ids to .quality && .id files]
[-e clean temp files. --Default "false"]
DECompression-> DECompress Multi-FastQ Files:
PMFFRC [-d *.pmffrc format file]
[-e clean temp files. --Default "false"]
Help (this message):
PMFFRC -h
notes:In the PMFFRC toolkit, the parameters beta, x1, and x2 are initialized in the *compressor script. The BIOCONDA version will be updated soon...
We present the validation dataset under the PMFFRC/data/testData
directory, which consists of 12 real FastQ sequencing files, each approximately 100MB.
The following are some examples of compression using PMFFRC cascading different compressors:
cd data
PMFFRC -c testData -y harc -t 2 -u 10 -q -e
Note: If the algorithm runs incorrectly, it may be a problem that the cascading algorithm environment depends on,
please check the testData/harc_2_10_pmffrc_output/C1.log
file to view the specific error information.
cd testData
PMFFRC -d harc_2_10_testData.pmffrc -y harc
# The extracted files are located in the following file directories:
# *testData/de_harc_2_10_testData
cd data
PMFFRC -c testData -y spring -t 4 -u 2
cd testData
PMFFRC -d spring_4_2_testData.pmffrc -y spring
# The extracted files are located in the following file directories:
# *testData/spring_4_2_testData
DEMO URL: https://www.youtube.com/embed/aB1SvhCh7ww?start=1
Our experiment was conducted on the SUGON-7000A supercomputer system at the Nanning Branch of the National Supercomputing Center, using a queue of CPU/GPU heterogeneous computing nodes. The compute nodes used in the experiment were configured as follows:
2*Intel Xeon Gold 6230 CPU (2.1Ghz, total 40 cores),
2*NVIDIA Tesla-T4 GPU (16GB CUDA memory, 2560 CUDA cores),
512GB DDR4 memory, and
8*900GB external storage.
We experimentally evaluated using the real publicly available sequencing datasets from the NCBI database, Homo sapiens (智人), Cicer arietinum (鹰嘴豆), and Salvelinus fontinalis (美洲红点鲑).
1、For the Homo sapiens dataset, we randomly selected the following registration numbers: ERR7091240-1243;1245-1248; 1253-1256;1258-1269 (24 SE-Files).
download this dataset by the following command:
bash data/NextSeq-550_Homo_sapiens_SE.sh
2、For the Cicer arietinum dataset, we randomly selected the following registration numbers: SRR13556190-13556217;220;224 (60 PE-Files).
download this dataset by the following command:
bash data/HiSeq-2000_Cicer_arietinum_PE.sh
3、For the Salvelinus fontinalis dataset, we randomly selected the following registration numbers: SRR11994925-SRR11995284 (360 SE-Files).
download this dataset by the following command:
bash data/Ion-Torrent_Salvelinus_fontinalis_SE.sh
Dataset download and extraction using the SRA-Tools:https://github.com/ncbi/sra-tools tool
.
- Thanks to @HPC-GXU for the computing device support.
- Thanks to @NCBI for all available datasets.
- Thanks to @HARC-Project for HARC source code.
- Thanks to @SPRING-Project for Spring source code.
- Thanks to @MSTCOM-Project for MSTCOM source code.
- Thanks to @FASTQCLS-Project for FastqCLS source code.
Source-Version: V1.2022.10.14. V2.2023.02.26.
Latest-Version: V2.1.2023.04.17.
Authors: NBJL-BioGrop.
Contact us: https://nbjl.nankai.edu.cn OR [email protected]
Cite PMFFRC: Sun, H., Zheng, Y., Xie, H. et al. PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering. BMC Bioinformatics 24, 454 (2023). https://doi.org/10.1186/s12859-023-05566-9