Skip to content

Commit

Permalink
[Feature] Distributed graph store (dmlc#1383)
Browse files Browse the repository at this point in the history
* initial version from distributed training.

This is copied from multiprocessing training.

* modify for distributed training.

* it's runnable now.

* measure time in neighbor sampling.

* simplify neighbor sampling.

* fix a bug in distributed neighbor sampling.

* allow single-machine training.

* fix a bug.

* fix a bug.

* fix openmp.

* make some improvement.

* fix.

* add prepare in the sampler.

* prepare nodeflow async.

* fix a bug.

* get id.

* simplify the code.

* improve.

* fix partition.py

* fix the example.

* add more features.

* fix the example.

* allow one partition

* use distributed kvstore.

* do g2l map manually.

* fix commandline.

* a temp script to save reddit.

* fix pull_handler.

* add pytorch version.

* estimate the time for copying data.

* delete unused code.

* fix a bug.

* print id.

* fix a bug

* fix a bug

* fix a bug.

* remove redundent code.

* revert modify in sampler.

* fix temp script.

* remove pytorch version.

* fix.

* distributed training with pytorch.

* add distributed graph store.

* fix.

* add metis_partition_assignment.

* fix a few bugs in distributed graph store.

* fix test.

* fix bugs in distributed graph store.

* fix tests.

* remove code of defining DistGraphStore.

* fix partition.

* fix example.

* update run.sh.

* only read necessary node data.

* batching data fetch of multiple NodeFlows.

* simplify gcn.

* remove unnecessary code.

* use the new copy_from_kvstore.

* update training script.

* print time in graphsage.

* make distributed training runnable.

* use val_nid.

* fix train_sampling.

* add distributed training.

* add run.sh

* add more timing.

* fix a bug.

* save graph metadata when partition.

* create ndata and edata in distributed graph store.

* add timing in minibatch training of GraphSage.

* use pytorch distributed.

* add checks.

* fix a bug in global vs. local ids.

* remove fast pull

* fix a compile error.

* update and add new APIs.

* implement more methods in DistGraphStore.

* update more APIs.

* rename it to DistGraph.

* rename to DistTensor

* remove some unnecessary API.

* remove unnecessary files.

* revert changes in sampler.

* Revert "simplify gcn."

This reverts commit 0ed3a34.

* Revert "simplify neighbor sampling."

This reverts commit 551c72d.

* Revert "measure time in neighbor sampling."

This reverts commit 63ae80c.

* Revert "add timing in minibatch training of GraphSage."

This reverts commit e59dc89.

* Revert "fix train_sampling."

This reverts commit ea6aea9.

* fix lint.

* add comments and small update.

* add more comments.

* add more unit tests and fix bugs.

* check the existence of shared-mem graph index.

* use new partitioned graph storage.

* fix bugs.

* print error in fast pull.

* fix lint

* fix a compile error.

* save absolute path after partitioning.

* small fixes in the example

* Revert "[kvstore] support any data type for init_data() (dmlc#1465)"

This reverts commit 87b6997.

* fix a bug.

* disable evaluation.

* Revert "Revert "[kvstore] support any data type for init_data() (dmlc#1465)""

This reverts commit f5b8039.

* support set and init data.

* support set and init data.

* Revert "Revert "[kvstore] support any data type for init_data() (dmlc#1465)""

This reverts commit f5b8039.

* fix bugs.

* fix unit test.

* move to dgl.distributed.

* fix lint.

* fix lint.

* remove local_nids.

* fix lint.

* fix test.

* remove train_dist.

* revert train_sampling.

* rename funcs.

* address comments.

* address comments.

Use NodeDataView/EdgeDataView to keep track of data.

* address comments.

* address comments.

* revert.

* save data with DGL serializer.

* use the right way of getting shape.

* fix lint.

* address comments.

* address comments.

* fix an error in mxnet.

* address comments.

* add edge_map.

* add more test and fix bugs.

Co-authored-by: Zheng <[email protected]>
Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Ubuntu <[email protected]>
  • Loading branch information
7 people authored May 3, 2020
1 parent 5fc334f commit 2190c39
Show file tree
Hide file tree
Showing 16 changed files with 1,103 additions and 45 deletions.
2 changes: 0 additions & 2 deletions examples/pytorch/graphsage/train_sampling.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,6 @@ def inference(self, g, x, batch_size, device):
Inference with the GraphSAGE model on full neighbors (i.e. without neighbor sampling).
g : the entire graph.
x : the input of entire node set.
The inference code is written in a fashion that it could handle any number of nodes and
layers.
"""
Expand Down Expand Up @@ -114,7 +113,6 @@ def prepare_mp(g):
Explicitly materialize the CSR, CSC and COO representation of the given graph
so that they could be shared via copy-on-write to sampler workers and GPU
trainers.
This is a workaround before full shared memory support on heterogeneous graphs.
"""
g.in_degree(0)
Expand Down
11 changes: 9 additions & 2 deletions include/dgl/runtime/shared_mem.h
Original file line number Diff line number Diff line change
Expand Up @@ -58,13 +58,20 @@ class SharedMemory {
* \param size the size of the shared memory.
* \return the address of the shared memory
*/
void *create_new(size_t size);
void *CreateNew(size_t size);
/*
* \brief allocate shared memory that has been created.
* \param size the size of the shared memory.
* \return the address of the shared memory
*/
void *open(size_t size);
void *Open(size_t size);

/*
* \brief check if the shared memory exist.
* \param name the name of the shared memory.
* \return a boolean value to indicate if the shared memory exists.
*/
static bool Exist(const std::string &name);
};
#endif // _WIN32

Expand Down
2 changes: 1 addition & 1 deletion python/dgl/contrib/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from . import sampling
from . import graph_store
from .dis_kvstore import KVClient, KVServer
from .dis_kvstore import read_ip_config
from .dis_kvstore import read_ip_config
2 changes: 1 addition & 1 deletion python/dgl/contrib/dis_kvstore.py
Original file line number Diff line number Diff line change
Expand Up @@ -1381,4 +1381,4 @@ def _default_push_handler(self, name, ID, data, target):
self._data_store
"""
target[name][ID] = data


4 changes: 4 additions & 0 deletions python/dgl/distributed/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
"""DGL distributed."""

from .dist_graph import DistGraphServer, DistGraph
from .partition import partition_graph, load_partition
Loading

0 comments on commit 2190c39

Please sign in to comment.