Requirements • Query Factory • Dual-Entity Query Dataset • Pathfinding System • License
This repository contains all materials for reproducing the outcomes described in the research paper Further Investigation of Fast Pathfinding in Wikidata. This comprises the following artifacts:
- A Query Factory for deriving a dual-entity query dataset for pathfinding in Wikidata
- The derived dual-entity query dataset
- A Pathfinding System for finding paths between arbitrary entities in Wikidata
The next paragraphs provide information about each artifact. This includes instructions for reproducing the results mentioned in the paper. Due to continuous updates made to Wikidata, rerunning the optimizer and the benchmark might yield slightly different results. To alleviate this problem, all information retrieved from Wikidata was cached and included in this repository.
Only docker
and docker-compose
are required to run the programs within this repository. All dependencies are automatically installed using the corresponding Dockerfiles. This ensures reproducibility and ease of use. For guidance on how to install Docker click here.
The purpose of the Query Factory is to derive dual-entity queries for pathfinding in Wikidata from the TREC 2007 Million Queries Track dataset. For identifying and disambiguating the entities mentioned in the TREC queries the GENRE entity linker is employed.
To run the Query Factory proceed as follows:
- Select the TREC file from which queries should be derived by adjusting the commented parts in the query_factory.py.
- Run
docker compose run query_factory
from the root directory. - In the new bash run
factory 07
to start the query factory. Warning: This will overwrite the already present dual-entity query dataset.
The dual-entity query dataset derived using the Query Factory can be found here. It uses the CSV format; the columns have the following meaning:
- wikidata_id_a: The Wikidata ID of the first entity of the query
- wikidata_id_b: The Wikidata ID of the second entity of the query
- trec_id: The ID of the original TREC query
This artifact actually comprises three components that implement the pathfinding. The pathfinder component contains the actual pathfinding algorithm and interacts with two API over HTTP: To issue queries on Wikidata, it interacts with the wikidata_api and, to calculate semantic distances between entities, it interacts with the wembed_api.
To run the Pathfinding System proceed as follows:
- Launch the Wikidata API via
docker-compose run --service-ports wikidata_api
in a separate bash. - Launch the Wembed API via
docker-compose run --service-ports wembed_api
in a separate bash. - Run
docker-compose run pathfinder
in a separate bash to launch the Pathfinder component. There are several commands that can be used in this new bash:- Run
cargo run -- playground
to launch the pathfinder on a few example queries. - Run
cargo run -- optimizer
to run the optimizer for fitting the search parameters alpha, beta, and gamma. Warning: This will overwrite the already present optimizer results file. - Run
cargo run -- benchmark
to run the benchmark. Warning: This will overwrite the already present benchmark results files.
- Run
To activate the debugging logger level, add the debug
flag to one of the commands from 3.1, 3.2, and 3.3. For example cargo run -- playground debug
runs the pathfinder with verbose logging.
See LICENSE