[Feature] Distributed graph store (dmlc#1383)

* initial version from distributed training. This is copied from multiprocessing training. * modify for distributed training. * it's runnable now. * measure time in neighbor sampling. * simplify neighbor sampling. * fix a bug in distributed neighbor sampling. * allow single-machine training. * fix a bug. * fix a bug. * fix openmp. * make some improvement. * fix. * add prepare in the sampler. * prepare nodeflow async. * fix a bug. * get id. * simplify the code. * improve. * fix partition.py * fix the example. * add more features. * fix the example. * allow one partition * use distributed kvstore. * do g2l map manually. * fix commandline. * a temp script to save reddit. * fix pull_handler. * add pytorch version. * estimate the time for copying data. * delete unused code. * fix a bug. * print id. * fix a bug * fix a bug * fix a bug. * remove redundent code. * revert modify in sampler. * fix temp script. * remove pytorch version. * fix. * distributed training with pytorch. * add distributed graph store. * fix. * add metis_partition_assignment. * fix a few bugs in distributed graph store. * fix test. * fix bugs in distributed graph store. * fix tests. * remove code of defining DistGraphStore. * fix partition. * fix example. * update run.sh. * only read necessary node data. * batching data fetch of multiple NodeFlows. * simplify gcn. * remove unnecessary code. * use the new copy_from_kvstore. * update training script. * print time in graphsage. * make distributed training runnable. * use val_nid. * fix train_sampling. * add distributed training. * add run.sh * add more timing. * fix a bug. * save graph metadata when partition. * create ndata and edata in distributed graph store. * add timing in minibatch training of GraphSage. * use pytorch distributed. * add checks. * fix a bug in global vs. local ids. * remove fast pull * fix a compile error. * update and add new APIs. * implement more methods in DistGraphStore. * update more APIs. * rename it to DistGraph. * rename to DistTensor * remove some unnecessary API. * remove unnecessary files. * revert changes in sampler. * Revert "simplify gcn." This reverts commit 0ed3a34. * Revert "simplify neighbor sampling." This reverts commit 551c72d. * Revert "measure time in neighbor sampling." This reverts commit 63ae80c. * Revert "add timing in minibatch training of GraphSage." This reverts commit e59dc89. * Revert "fix train_sampling." This reverts commit ea6aea9. * fix lint. * add comments and small update. * add more comments. * add more unit tests and fix bugs. * check the existence of shared-mem graph index. * use new partitioned graph storage. * fix bugs. * print error in fast pull. * fix lint * fix a compile error. * save absolute path after partitioning. * small fixes in the example * Revert "[kvstore] support any data type for init_data() (dmlc#1465)" This reverts commit 87b6997. * fix a bug. * disable evaluation. * Revert "Revert "[kvstore] support any data type for init_data() (dmlc#1465)"" This reverts commit f5b8039. * support set and init data. * support set and init data. * Revert "Revert "[kvstore] support any data type for init_data() (dmlc#1465)"" This reverts commit f5b8039. * fix bugs. * fix unit test. * move to dgl.distributed. * fix lint. * fix lint. * remove local_nids. * fix lint. * fix test. * remove train_dist. * revert train_sampling. * rename funcs. * address comments. * address comments. Use NodeDataView/EdgeDataView to keep track of data. * address comments. * address comments. * revert. * save data with DGL serializer. * use the right way of getting shape. * fix lint. * address comments. * address comments. * fix an error in mxnet. * address comments. * add edge_map. * add more test and fix bugs. Co-authored-by: Zheng <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Ubuntu <[email protected]>
jay2012-lin · May 3, 2020 · 2190c39 · 2190c39
1 parent 5fc334f
commit 2190c39
Show file tree

Hide file tree

Showing 16 changed files with 1,103 additions and 45 deletions.
diff --git a/examples/pytorch/graphsage/train_sampling.py b/examples/pytorch/graphsage/train_sampling.py
@@ -78,7 +78,6 @@ def inference(self, g, x, batch_size, device):
         Inference with the GraphSAGE model on full neighbors (i.e. without neighbor sampling).
         g : the entire graph.
         x : the input of entire node set.
-
         The inference code is written in a fashion that it could handle any number of nodes and
         layers.
         """
@@ -114,7 +113,6 @@ def prepare_mp(g):
     Explicitly materialize the CSR, CSC and COO representation of the given graph
     so that they could be shared via copy-on-write to sampler workers and GPU
     trainers.
-
     This is a workaround before full shared memory support on heterogeneous graphs.
     """
     g.in_degree(0)

diff --git a/include/dgl/runtime/shared_mem.h b/include/dgl/runtime/shared_mem.h
@@ -58,13 +58,20 @@ class SharedMemory {
    * \param size the size of the shared memory.
    * \return the address of the shared memory
    */
-  void *create_new(size_t size);
+  void *CreateNew(size_t size);
   /*
    * \brief allocate shared memory that has been created.
    * \param size the size of the shared memory.
    * \return the address of the shared memory
    */
-  void *open(size_t size);
+  void *Open(size_t size);
+
+  /*
+   * \brief check if the shared memory exist.
+   * \param name the name of the shared memory.
+   * \return a boolean value to indicate if the shared memory exists.
+   */
+  static bool Exist(const std::string &name);
 };
 #endif  // _WIN32
 

diff --git a/python/dgl/contrib/__init__.py b/python/dgl/contrib/__init__.py
@@ -1,4 +1,4 @@
 from . import sampling
 from . import graph_store
 from .dis_kvstore import KVClient, KVServer
-from .dis_kvstore import read_ip_config
+from .dis_kvstore import read_ip_config
diff --git a/python/dgl/contrib/dis_kvstore.py b/python/dgl/contrib/dis_kvstore.py
@@ -1381,4 +1381,4 @@ def _default_push_handler(self, name, ID, data, target):
             self._data_store
         """
         target[name][ID] = data
-
+
diff --git a/python/dgl/distributed/__init__.py b/python/dgl/distributed/__init__.py
@@ -0,0 +1,4 @@
+"""DGL distributed."""
+
+from .dist_graph import DistGraphServer, DistGraph
+from .partition import partition_graph, load_partition
Original file line number	Diff line number	Diff line change
Expand Up		@@ -1381,4 +1381,4 @@ def _default_push_handler(self, name, ID, data, target):
		self._data_store
		"""
		target[name][ID] = data