Dataset Similarity Ranking

This project was done for a Capita Selecta project at the Eindhoven University of Technology. This project can give a dataset similarity ranking based on all datasets in the OpenML CC-18 classification suite. Two ranking are used, one uses the Dataset2Vec meta-feature extractor as described in the paper "Dataset2Vec: Learning Dataset Meta-Features." and the other uses the PyMFE meta-features.

Next to producing two similarity rankings, these rankings can be evaluated using the meta-learning task of model selection.

Usage

To produce a similarity ranking, run the rank_data_set_similarity.py file, with and OpenML dataset id as input. For example:

python rank_dataset_similarity.py --input_dataset 14

The meta-features can be re-extracted by:

python extract_meta_features.py

This will output two csv files in the folder /extracted_MF

The performance of the two similarity rankings, based on the previously extracted meta-features, can be evaluated by running:

python evaluate_similarity.py

This outputs a csv file in the folder /similarity_evaluation. Note: this has a long running time.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
D2V		D2V
MFE		MFE
extracted_MF		extracted_MF
similarity_evaluation		similarity_evaluation
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
evaluate_similarity.py		evaluate_similarity.py
extract_meta_features.py		extract_meta_features.py
rank_data_set_similarity.py		rank_data_set_similarity.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset Similarity Ranking

Usage

About

Releases

Packages

Languages

jmniederle/dataset_similarity

Folders and files

Latest commit

History

Repository files navigation

Dataset Similarity Ranking

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages