Skip to content

Latest commit

 

History

History
41 lines (31 loc) · 1.5 KB

File metadata and controls

41 lines (31 loc) · 1.5 KB

Exact Dedup

Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.

Additional parameters

In addition to common ededup parameters Ray implementation provides two additional ones

  • hash_cpu - specifies amount of CPU per hash actor
  • num_hashes - specifies number of hash actors

ådditional support

We also provide an estimate to roughly determine cluster size for running transformer.

Running

Launched Command Line Options

When running the transform with the Ray launcher (i.e. TransformLauncher), the following command line arguments are available in addition to the options provided by the launcher.

  --ededup_hash_cpu EDEDUP_HASH_CPU
                        number of CPUs per hash
  --ededup_num_hashes EDEDUP_NUM_HASHES
                        number of hash actors to use
  --ededup_doc_column EDEDUP_DOC_COLUMN
                        name of the column containing document
  --ededup_doc_id_column EDEDUP_DOC_ID_COLUMN
                        name of the column containing document id
  --ededup_use_snapshot EDEDUP_USE_SNAPSHOT
                        flag to continue from snapshot
  --ededup_snapshot_directory EDEDUP_SNAPSHOT_DIRECTORY
                        location of snapshot files                      

These correspond to the configuration keys described above.