Skip to content

NadavElias/DedupSearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

DedupSearch

General Information

DedupSearch is a string search algrithm for deduplicated storage systems.

Implementation is based on the open-source deduplication system, Destor, publicly available in C: "Design Tradeoffs for Data Deduplication Performance in Backup Workloads" (FAST 15).

We modified a publicly available implementation of the Aho Corasick algorithm in C++ to improve its data structures, memory locality, and suffix matching, and to support non-ASCII strings. For best integration of this implementation into Destor, we refactored the Destor code to use C++ instead of C.

Environment

Ubuntu 16.04

Dependencies

  1. libssl-dev is required to calculate sha-1 digest;

  2. GLib 2.32 or later version

  3. Berkeley Database

    libglib.so and glib.h may not be found when you first install it.

    The header files (that originally locate in /usr/local/include/glib-2.0 and /usr/local/lib/glib-2.0/include) are required to be moved to a searchable path, such as "/usr/local/include".

    Also a link named libglib.so should be created in "/usr/local/lib".

    Install Berkeley DB by executing "sudo apt-get install libdb-dev"

  4. Makefile is automatically generated by GNU autoconf and automake.

INSTALL

If all dependencies are installed, compiling destor with 2D-search is straightforward:

./configure CC="g++-7 -fpermissive --std=c++11 -Ofast"

make

make install

To uninstall destor, run

make uninstall

Running

If compile and install are successful, the executable file, destor, should have been moved to /usr/local/bin by default. You can create a config file, destor.config, in where you run destor. A sample destor.config is in the project directory.

NOTE: run rebuild script before destor backup task to prepare working directory and clear data.

destor can run as follows:

  1. start a backup task,

    destor /path/to/data -p"a line as in config file"

  2. start a restore task,

    destor -r /path/to/restore -p"a line as in config file"

  3. show the current statistics of system,

    destor -s

  4. show help

    destor -h

  5. make a trace

    destor -t /path/to/data

  6. Parameter

    -p"a line in config file"

  7. Naive search

    destor -n<JOB_ID> keyword1 keyword2 ... Use at end of command only!

  8. DedupSearch search

    destor -q<JOB_ID> keyword1 keyword2 ... Use at end of command only!

  9. Tiny directory for DedupSearch search

    -d"path to tiny's directory"

  10. Keywords input file for DedupSearch search where each keyword is XX chars long (for binary keywords)

    -f"path to keywords file" -lXX

Searching Examples

Search "hello" and "bye" with naive search

destor -n0 hello bye

Search "hello" with DedupSearch without 'Tiny' structure

destor -q0 hello

Search "hello" with DedupSearch with 'Tiny' structure, wich will be temporary saved at "./tiny_dir/tiny.db"

destor -d"./tiny_dir" -q0 hello

Search the keywords from the file "keywords.txt", where each keywords is 32 chars long, with DedupSearch without 'Tiny' structure

destor -f"keywords.txt" -q0 -l32

Configuration

A sample configuration is shown in destor.conf

Paper

To read more about DedupSearch please check our paper: "DedupSearch: Two-Phase Deduplication Aware Keyword Search". In 20th USENIX Conference on File and Storage Technologies (FAST 22), 2022.

Autors: Nadav Elias, Gala Yadgar (Technion - Israel Institute of Technology); Philip Shilane (Dell Technologies); Sarai Sheinvald (ORT Braude College of Engineering).

Email : nadav[dot]elias[at]technion[dot]ac[dot]il

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published