Code for Thesis: A Non-Parametric Text Classification Approach Utilizing Lossless Compression Models

This codebase is the extension of https://github.com/bazingagin/npc_gzip, which was provided with the examined paper.

Original Codebase

Require

Installation of Conda or Miniconda.

See requirements.txt.

Install requirements in a clean environment:

conda create -n npc python=3.7
conda activate npc
pip install -r requirements.txt

Run

python main_text.py

By default, this will only use 100 test and training samples per class as a quick demo. They can be changed by --num_test, --num_train.

--compressor <gzip, lzw>
--dataset <AG_NEWS, DBpedia, YahooAnswers, 20News, R8, R52, kinnews, kirnews, swahili, filipino, trec, emotion>
--num_train <INT>
--num_test <INT>
--all_test [This will use the whole test dataset.]
--all_train [This will use the whole train dataset.]
--record [This will record the distance matrix in order to save for the future use. It's helpful when you when to run on the whole dataset.]
--test_idx_start <INT>
--test_idx_end <INT> [These two args help us to run on a certain range of test set. Also helpful for calculating the distance matrix on the whole dataset.]
--para [This will use multiprocessing to accelerate.]
--output_dir <DIR> [The output directory to save information of tested indices or distance matrix.]

Example: --dataset trec --all_test --all_train --para (for calculation of accuracy) --dataset trec --all_test --all_train --record --para --output_dir xxx (for saving of the calculated NCD)

Calculate Accuracy (Optional)

If we want to calculate accuracy from recorded distance file <DISTANCE DIR>, use

python main_text.py --record --score --distance_fn <DISTANCE DIR>

to calculate accuracy. Otherwise, the accuracy will be calculated automatically using the command in the last section.

Name		Name	Last commit message	Last commit date
Latest commit History 181 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
data/datasets/AG_NEWS		data/datasets/AG_NEWS
examples		examples
extradata		extradata
npc_gzip		npc_gzip
original_codebase		original_codebase
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code for Thesis: A Non-Parametric Text Classification Approach Utilizing Lossless Compression Models

Original Codebase

Require

Run

Calculate Accuracy (Optional)

About

Releases

Packages

Languages

License

nason31/BA

Folders and files

Latest commit

History

Repository files navigation

Code for Thesis: A Non-Parametric Text Classification Approach Utilizing Lossless Compression Models

Original Codebase

Require

Run

Calculate Accuracy (Optional)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages