Code for Thesis: A Non-Parametric Text Classification Approach Utilizing Lossless Compression Models
This codebase is the extension of, which was provided with the examined paper.
Installation of Conda or Miniconda.
See requirements.txt
Install requirements in a clean environment:
conda create -n npc python=3.7
conda activate npc
pip install -r requirements.txt
By default, this will only use 100 test and training samples per class as a quick demo. They can be changed by --num_test
, --num_train
--compressor <gzip, lzw>
--dataset <AG_NEWS, DBpedia, YahooAnswers, 20News, R8, R52, kinnews, kirnews, swahili, filipino, trec, emotion>
--num_train <INT>
--num_test <INT>
--all_test [This will use the whole test dataset.]
--all_train [This will use the whole train dataset.]
--record [This will record the distance matrix in order to save for the future use. It's helpful when you when to run on the whole dataset.]
--test_idx_start <INT>
--test_idx_end <INT> [These two args help us to run on a certain range of test set. Also helpful for calculating the distance matrix on the whole dataset.]
--para [This will use multiprocessing to accelerate.]
--output_dir <DIR> [The output directory to save information of tested indices or distance matrix.]
Example: --dataset trec --all_test --all_train --para (for calculation of accuracy) --dataset trec --all_test --all_train --record --para --output_dir xxx (for saving of the calculated NCD)
If we want to calculate accuracy from recorded distance file <DISTANCE DIR>
, use
python --record --score --distance_fn <DISTANCE DIR>
to calculate accuracy. Otherwise, the accuracy will be calculated automatically using the command in the last section.