DedupSearch is a string search algrithm for deduplicated storage systems.
Implementation is based on the open-source deduplication system, Destor, publicly available in C: "Design Tradeoffs for Data Deduplication Performance in Backup Workloads" (FAST 15).
We modified a publicly available implementation of the Aho Corasick algorithm in C++ to improve its data structures, memory locality, and suffix matching, and to support non-ASCII strings. For best integration of this implementation into Destor, we refactored the Destor code to use C++ instead of C.
Ubuntu 16.04
-
libssl-dev is required to calculate sha-1 digest;
-
GLib 2.32 or later version
-
Berkeley Database
libglib.so and glib.h may not be found when you first install it.
The header files (that originally locate in /usr/local/include/glib-2.0 and /usr/local/lib/glib-2.0/include) are required to be moved to a searchable path, such as "/usr/local/include".
Also a link named libglib.so should be created in "/usr/local/lib".
Install Berkeley DB by executing "sudo apt-get install libdb-dev"
-
Makefile is automatically generated by GNU autoconf and automake.
If all dependencies are installed, compiling destor with 2D-search is straightforward:
./configure CC="g++-7 -fpermissive --std=c++11 -Ofast"
make
make install
To uninstall destor, run
make uninstall
If compile and install are successful, the executable file, destor, should have been moved to /usr/local/bin by default. You can create a config file, destor.config, in where you run destor. A sample destor.config is in the project directory.
NOTE: run rebuild script before destor backup task to prepare working directory and clear data.
destor can run as follows:
-
start a backup task,
destor /path/to/data -p"a line as in config file"
-
start a restore task,
destor -r /path/to/restore -p"a line as in config file"
-
show the current statistics of system,
destor -s
-
show help
destor -h
-
make a trace
destor -t /path/to/data
-
Parameter
-p"a line in config file"
-
Naive search
destor -n<JOB_ID> keyword1 keyword2 ... Use at end of command only!
-
DedupSearch search
destor -q<JOB_ID> keyword1 keyword2 ... Use at end of command only!
-
Tiny directory for DedupSearch search
-d"path to tiny's directory"
-
Keywords input file for DedupSearch search where each keyword is XX chars long (for binary keywords)
-f"path to keywords file" -lXX
Search "hello" and "bye" with naive search
destor -n0 hello bye
Search "hello" with DedupSearch without 'Tiny' structure
destor -q0 hello
Search "hello" with DedupSearch with 'Tiny' structure, wich will be temporary saved at "./tiny_dir/tiny.db"
destor -d"./tiny_dir" -q0 hello
Search the keywords from the file "keywords.txt", where each keywords is 32 chars long, with DedupSearch without 'Tiny' structure
destor -f"keywords.txt" -q0 -l32
A sample configuration is shown in destor.conf
To read more about DedupSearch please check our paper: "DedupSearch: Two-Phase Deduplication Aware Keyword Search". In 20th USENIX Conference on File and Storage Technologies (FAST 22), 2022.
Autors: Nadav Elias, Gala Yadgar (Technion - Israel Institute of Technology); Philip Shilane (Dell Technologies); Sarai Sheinvald (ORT Braude College of Engineering).
Email : nadav[dot]elias[at]technion[dot]ac[dot]il