Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
graph_construction		graph_construction
homograph_injection		homograph_injection
network_analysis		network_analysis
.gitignore		.gitignore
README.MD		README.MD
TUS_benchmark.sh		TUS_benchmark.sh
requirements.txt		requirements.txt
synthetic_benchmark.sh		synthetic_benchmark.sh

Repository files navigation

Domain Net

Repository Overview

DATA: a folder that will contain all the input data (Instruction on how to download the data can be found at: https://github.com/northeastern-datalab/DomainNet-Datasets)
graph_construction: module to construct a graph representation given a repository of tables
homograph_injection: module to artificially inject homographs in a repository
network_analysis: module to run network centrality measures on our graph representation and provide a score for each value in the repository.

Setup

Clone the repo
CD to the repo directory. Create and activate a virtual environment for this project

On macOS or Linux:

python3 -m venv env
source env/bin/activate
which python

Install necessary packages
```
pip install -r requirements.txt
```

We recommend using python version 3.8.

Reproducibility

Synthetic Benchmark (SB)

To reproduce our results and analysis on the synthetic benchmark run the synthetic_benchmark.sh script. You can do that by running:

chmod +x synthetic_benchmark.sh && ./synthetic_benchmark.sh

The script will produce the bipartite graph representation for the synthetic benchmark and then calculate the BC scores for every node in that graph. Finally open and run all cells in the synthetic_benchmark_analysis.ipynb jupyter notebook file to see the analysis and produced figures.

Table Union Search (TUS) Benchmark

To reproduce our results and analysis on the table union search (TUS) benchmark run the TUS_benchmark.sh script. You can do that by running:

chmod +x TUS_benchmark.sh && ./TUS_benchmark.sh

The script will produce the bipartite graph representation for the synthetic benchmark and then calculate the approximate BC scores for every node in that graph. 5000 nodes are used for sampling. Finally the precision/recall/f1-score curves at various top-k values will be produced and can be found in the network_analysis/figures/TUS/ directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Domain Net

Repository Overview

Setup

Reproducibility

Synthetic Benchmark (SB)

Table Union Search (TUS) Benchmark

About

Releases

Packages

Contributors 2

Languages

License

northeastern-datalab/domain_net

Folders and files

Latest commit

History

Repository files navigation

Domain Net

Repository Overview

Setup

Reproducibility

Synthetic Benchmark (SB)

Table Union Search (TUS) Benchmark

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages