Skip to content

Commit

Permalink
hysgen project, hypergraph data structure (THGraph), and correspondin…
Browse files Browse the repository at this point in the history
…g changes to the files are added --- PART 2 + updated two gitignore files.
  • Loading branch information
bpedrood committed Aug 11, 2022
1 parent 29127d2 commit a426fa4
Show file tree
Hide file tree
Showing 16 changed files with 5,804 additions and 0 deletions.
32 changes: 32 additions & 0 deletions examples/hysgen/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Prerequisites
*.d

# Compiled Object files
*.slo
*.lo
*.o
*.obj

# Precompiled Headers
*.gch
*.pch

# Compiled Dynamic libraries
*.so
*.dylib
*.dll

# Fortran module files
*.mod
*.smod

# Compiled Static libraries
*.lai
*.la
*.a
*.lib

# Executables
*.exe
*.out
*.app
8 changes: 8 additions & 0 deletions examples/hysgen/.idea/.gitignore

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

29 changes: 29 additions & 0 deletions examples/hysgen/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
BSD 3-Clause License

Copyright (c) 2022, Bahman Pedrood
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
11 changes: 11 additions & 0 deletions examples/hysgen/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#
# Makefile for this SNAP example
# - modify Makefile.ex when creating a new SNAP example
#
# implements:
# all (default), clean
#

include ../../Makefile.config
include Makefile.ex
include ../Makefile.exmain
9 changes: 9 additions & 0 deletions examples/hysgen/Makefile.ex
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
## Main application file
MAIN = hysgen_main
DEPH = $(EXSNAPADV)/hysgen.h
DEPCPP = $(EXSNAPADV)/hysgen.cpp
#CXXFLAGS += $(CXXOPENMP)
#CXXFLAGS += -g -rdynamic
#CXXFLAGS += -ggdb
#CXXFLAGS += -ggdb3 -rdynamic

72 changes: 72 additions & 0 deletions examples/hysgen/README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
========================================================================
Hypergraph Simultaneous Generators (HySGen)
========================================================================

This repository contains three main components:

1) HySGen: An efficient probabilistic generative model for discovering node
clusters/communities in hypergraphs. For the details of the model and
the community inference algorithm, please see our paper*.
2) HGraph: A fast, reliable, and comprehensive C++ data structure for
undirected, unweighted hypergraphs.
3) Three hypergraphs extracted from real-world data, uploaded in the "Data"
directory. Please see our paper* for more information.

Please cite our paper upon using any of those components:
* B. Pedrood, C. Domeniconi, and K. Laskey. "Hypergraph Simultaneous Generators." AISTATS 2022.

/////////////////////////////////////////////////////////////////////////////

The code in this project is developed on top of the SNAP [(c) 2007-2019,
Jure Leskovec] open-source graph analysis library. To facilitate the usage
for SNAP users, I maintained the structure and code standarads as recommended
in SNAP. The directory structure of this project is as follows below:

snap:
An intact copy of the original SNAP library's source code, which
modules are used in this project.
local_snap:
We developed our classes in this directory. The subdirectories
and file structures are chosen this way for maximum consistency
with SNAP.
local_snap/snap-adv:
HySGen's implemented classes and functions for community inference.
local_snap/snap-core:
HGraph data structure is implemented in this directory. To see
the details of the function and classes, see the files with
"loc_graph" and "loc_subgraph" names.

Like other SNAP projects, this code works under Windows with Cygwin with GCC,
Mac OS X, Linux and other Unix variants with GCC. To use with Visual Studio,
you have to create a new project for this project. Make sure that a C++ compiler
is installed on the system. Makefiles are provided, so you can complie the code
in the command line with the following command:
make all

/////////////////////////////////////////////////////////////////////////////

Parameters:
-i: Input [hyper]edgelist file url.
-o: Output file url + name prefix for the discovered communities.
-c: The number of communities to detect.
-op: Output file performance plot (Default: empty for no plot).
-ci: Community initialization file url (Default: empty).
-l: Url for node names file (Default: empty).
-mc: Minimum size of the communities(Default: 3).
-rs: Random Seed.
-xi: Maximum number of iterations (Default: 1000).
-ic: Initial membership value for the seed communities (Default: 0.1).
-in: The default membership value of each node to all the communities (Default: 0.03).
-rp: Ratio of initial memberships to be randomly perturbed (Default: 0.0).
-rw: Weight for l-1 regularization on learning the model parameters (Default: 0.0)
-sz: Initial step size for backtracking line search (Default: 0.5).
-sa: Control parameter for backtracking line search (Default: 0.5).
-sr: Step-size reduction ratio for backtracking line search (Default: 0.5).

/////////////////////////////////////////////////////////////////////////////

Usage:

Discover 309 communities from the NSF collaboration hypergraph.

./hysgen_main -i:./Data/NSF/hypergraph.hyperedges -o:./out_communities -c:309 -mc:3 -ic:0.1 -in:0.001 -rw:0.0001 -sa:0.95 -sz:0.01
86 changes: 86 additions & 0 deletions examples/hysgen/hysgen_main.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
#include "hysgen.h"
#include "agm.h"

int main(int argc, char* argv[]) {
Env = TEnv(argc, argv, TNotify::StdNotify);
Env.PrepArgs(TStr::Fmt("HySGen. build: %s, %s. Time: %s", __TIME__, __DATE__, TExeTm::GetCurTm()));
TExeTm ExeTm;

Try

const TStr InFNm = Env.GetIfArgPrefixStr("-i:", "./synthetic_data/synthetic.hyperedges", "Input [hyper]edgelist file url.");
const TStr OutFPrx = Env.GetIfArgPrefixStr("-o:", "./synthetic_res", "Output file url + name prefix for the discovered communities.");
int OptComs = Env.GetIfArgPrefixInt("-c:", 2, "The number of communities to detect.");
const TStr OutPlt = Env.GetIfArgPrefixStr("-op:", "", "Output file performance plot (empty for no plot).");
const TStr InitComFNm = Env.GetIfArgPrefixStr("-ci:", "", "Community initialization file url.");
const TStr LabelFNm = Env.GetIfArgPrefixStr("-l:", "", "Input file name for node names (Node ID, Node label) ");
const int MinComSize = Env.GetIfArgPrefixInt("-mc:", 3, "Minimum size of the communities.");
const int RndSeed = Env.GetIfArgPrefixInt("-rs:", 0, "Random Seed");
int MaxIter = Env.GetIfArgPrefixInt("-xi:", 1000, "Maximum number of iterations");
const double InitComS = Env.GetIfArgPrefixFlt("-ic:", 0.1, "Initial membership value for the initially assigned communities");
const double InitNulS = Env.GetIfArgPrefixFlt("-in:", 0.03, "The default membership value of each node to all the communities");
double PerturbDensity = Env.GetIfArgPrefixFlt("-rp:", 0.0, "Ratio of initial memberships to be randomly perturbed.");
const double RegCoef = Env.GetIfArgPrefixFlt("-rw:", 0.0, "Weight for l-1 regularization on learning the model parameters");
const double StepSize = Env.GetIfArgPrefixFlt("-sz:", 0.5, "Initial step size for backtracking line search");
const double StepCtrlParam = Env.GetIfArgPrefixFlt("-sa:", 0.5, "Control parameter for backtracking line search");
const double StepReductionRatio = Env.GetIfArgPrefixFlt("-sr:", 0.5, "Step-size reduction ratio for backtracking line search");


PHGraph G;
TIntStrH NIDNameH, NIDEdgelistnameH;
TStrIntH NameNIdH, EdgelistnameNIdH;
TStrHash<TInt> NodeNameH;
TVec<TFltV> WckVV;
TVec<TIntFltH> EstCmtyVH;
TVec<TIntV> EstCmtyVV;
if (InFNm.IsSuffix(".hgraph")) {
TFIn GFIn(InFNm);
G = THGraph::Load(GFIn);
} else {
G = THysgenUtil::LoadEdgeList(InFNm, NodeNameH);
NIDNameH.Gen(NodeNameH.Len()); NIDEdgelistnameH.Gen(NodeNameH.Len());
NameNIdH.Gen(NodeNameH.Len()); EdgelistnameNIdH.Gen(NodeNameH.Len());
for (int s = 0; s < NodeNameH.Len(); s++) {
NIDNameH.AddDat(s, NodeNameH.GetKey(s));
NIDEdgelistnameH.AddDat(s, NodeNameH.GetKey(s));
NameNIdH.AddDat(NodeNameH.GetKey(s), s);
EdgelistnameNIdH.AddDat(NodeNameH.GetKey(s), s);
}
}
if (LabelFNm.Len() > 0) {
TSsParser Ss(LabelFNm, ssfTabSep);
while (Ss.Next()) {
if (Ss.Len() > 1) {NIDNameH.AddDat(NameNIdH.GetDat(Ss[0]), Ss.GetFld(1)); }
}
}
printf("HyperGraph: %d Nodes %d Edges\n", G->GetNodes(), G->GetEdges());

TIntV NIDV;
G->GetNIdV(NIDV);

TExeTm RunTm;
THysgen Optimizer(G, 5, RndSeed, InitComS, InitNulS);
Optimizer.ComInit(OptComs, MinComSize, PerturbDensity);
if (InitComFNm.Len() > 0) {
Optimizer.LoadComInit(InitComFNm);
}
Optimizer.SetRegCoef(RegCoef);

double Threshold = TFlt::EpsHalf;
Optimizer.GetCmtyVV(EstCmtyVH, EstCmtyVV, WckVV, InitNulS, MinComSize);
THysgenUtil::DumpCmtyVH(OutFPrx + "cmtyvv_init.txt", EstCmtyVH, NIDNameH);

Optimizer.MLEGradAscent(0.01, MaxIter * G->GetNodes(), OutPlt, StepSize, StepCtrlParam, StepReductionRatio);

Optimizer.GetCmtyVV(EstCmtyVH, EstCmtyVV, WckVV, Threshold, MinComSize);
THysgenUtil::DumpCmtyVH(OutFPrx + "cmty_ById_values.tsv", EstCmtyVH, NIDEdgelistnameH);
THysgenUtil::DumpCmtyVV(OutFPrx + "cmty_ById_members.txt", EstCmtyVV, NIDEdgelistnameH);
THysgenUtil::DumpCmtyVH(OutFPrx + "cmty_values.txt", EstCmtyVH, NIDNameH);
THysgenUtil::DumpCmtyVV(OutFPrx + "cmty_members.txt", EstCmtyVV, NIDNameH);

Catch

printf("\nrun time: %s (%s)\n", ExeTm.GetTmStr(), TSecTm::GetCurTm().GetTmStr().CStr());

return 0;
}
8 changes: 8 additions & 0 deletions examples/hysgen/stdafx.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
// stdafx.cpp : source file that includes just the standard includes
// cesna.pch will be the pre-compiled header
// stdafx.obj will contain the pre-compiled type information

#include "stdafx.h"

// TODO: reference any additional headers you need in STDAFX.H
// and not in this file
5 changes: 5 additions & 0 deletions examples/hysgen/stdafx.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#pragma once

#include "targetver.h"

#include "Snap.h"
2 changes: 2 additions & 0 deletions examples/hysgen/synthetic_data/ground_truth_comms.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
1 2 3 4 5 11 12 13 14 15 21 22 23 24 25
7 8 9 10 17 18 19 20 27 28 29 30
11 changes: 11 additions & 0 deletions examples/hysgen/synthetic_data/synthetic.description
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Supposed Scenario:

It's the first day of an academic year. Student gatherings have been recorded from some point during the spring semester last year. Assume there exist two communities of CS student and history students; the only social communities to which they belong. The members of these two communities are specified in the file "ground_truth_comms.txt".
The hypergraph for this example is a network of recorded gatherings in the university in the timeline mentioned before. Each gathering corresponds to a hyperedge that connects the attending students. The hypergraph is saved in "synthetic.hyperedges", where each line corresponds to the ID of the nodes in a hyperedge.
A regular graph equivalent of the hypergraph is represented in "synthetic.edges", where the list of the edges are stored. This graph is created by mapping a k-clique to a hyperedge of size k.


#################################
Community detection complication:

There are two large hyperedges in the hypergraph that make the problem of discoverying the communities complicated, which correspond to two outdoor welcome parties for the students. Nodes 36 through 71 in this hyperedges represent some passerbys who are not students, only joined the parties to enjoy the music, game and free food. Nodes 76 to 87 are new students, half (6) CS and half (6) histroy. The new students should not be correctly identified because the only gatherings they had so far has been an orientation, which has been gathered with 3 senior students of each major to talk about the dept for them; and of course the gathering of welcome party. In the party, they are divided into partiy groups (4-8) that are independent of their major.
Loading

0 comments on commit a426fa4

Please sign in to comment.