Skip to content
/ HIN Public

A code base for heterogeous information network embedding algorithms.

Notifications You must be signed in to change notification settings

xemcerk/HIN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Metapath2vec

Dependencies

  • PyTorch 1.0.1+

How to run the code

Run with the following procedures:

1, Run sampler.py on your graph dataset. Note that: the input text file should be list of mappings so you probably need to preprocess your graph dataset. Files with sample format are available in "net_dbis" file. Of course you could also use your own metapath sampler implementation.

2, Run the following command:

python metapath2vec.py --download "where/you/want/to/download" --output_file "your_output_file_path"

Tips: Change num_workers based on your GPU instances; Running 3 or 4 epochs is actually enough.

Tricks included in the implementation:

1, Sub-sampling;

2, Negative Sampling without repeatedly calling numpy random choices;

Performance and Explanations:

Venue Classification Results for Metapath2vec:

Metric 5% 10% 20% 30% 40% 50% 60% 70% 80% 90%
Macro-F1 0.3033 0.5247 0.8033 0.8971 0.9406 0.9532 0.9529 0.9701 0.9683 0.9670
Micro-F1 0.4173 0.5975 0.8327 0.9011 0.9400 0.9522 0.9537 0.9725 0.9815 0.9857

Author Classfication Results for Metapath2vec:

Metric 5% 10% 20% 30% 40% 50% 60% 70% 80% 90%
Macro-F1 0.9216 0.9262 0.9292 0.9303 0.9309 0.9314 0.9315 0.9316 0.9319 0.9320
Micro-F1 0.9279 0.9319 0.9346 0.9356 0.9361 0.9365 0.9365 0.9365 0.9367 0.9369

Note that:

Testing files are available in "label 2" file;

The above are results listed in the paper, in real experiments, exact numbers might be slightly different:

1, For venue node classification results, when the size of the training dataset is small (e.g. 5%), the variance of the performance is large since the number of available labeled venues is small.

2, For author node classification results, the performance is stable since the number of available labeled authors is huge, so even 5% training data would be sufficient.

3, In the test.py, you could change experiment times you want, especially it is very slow to test author classification so you could only do 1 or 2 times.

About

A code base for heterogeous information network embedding algorithms.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages