Skip to content

vigna/webgraph

Repository files navigation

Welcome to WebGraph!

Introduction

WebGraph is a framework for graph compression aimed at studying web graphs. It provides simple ways to manage very large graphs, exploiting modern compression techniques. More precisely, it is currently made of:

  1. A set of flat codes, called ζ codes, which are particularly suitable for storing web graphs (or, in general, integers with power-law distribution in a certain exponent range). The fact that these codes work well can be easily tested empirically, but we also try to provide a detailed mathematical analysis.

  2. Algorithms for compressing web graphs that exploit gap compression and referentiation (à la LINK), intervalisation and ζ codes to provide a high compression ratio (see our datasets). The algorithms are controlled by several parameters, which provide different tradeoffs between access speed and compression ratio.

  3. Algorithms for accessing a compressed graph without actually decompressing it, using lazy techniques that delay the decompression until it is actually necessary.

  4. Algorithms for analysing very large graphs, such as HyperBall which has been used to show that Facebook has just four degrees of separation.

  5. An implementation of the algorithms above in Java and Rust distributed under either the GNU Lesser General Public License 2.1+ or the Apache Software License 2.0. Besides a clearly defined API, we also provide several classes tha modify (e.g., transpose) or recompress a graph, so to experiment with various settings.

  6. Datasets for large graph. These are either gathered from public sources (such as WebBase), or produced by UbiCrawler and BUbiNG.

In the end, with WebGraph you can access and analyse very large web graphs. Using WebGraph is as easy as installing a few jar files and downloading a dataset. This makes studying phenomena such as PageRank, distribution of graph properties of the web graph, etc. very easy.

You are welcome to use and improve WebGraph! If you find our software useful for your research, please quote this paper.

This version of WebGraph is limited to graphs with at most 2³¹ nodes. For larger graphs, have a look at the big version.

Hadoop

Helge Holzmann has developed an input format for Hadoop for graphs in BVGraph format.

WebGraph++

Jacob Ratkievicz has developed a C++ version of WebGraph that you might want to try. The library exposes a BVGraph as an object of the Boost Graph Library, so it is easily integrable with other code.

pyWebgraph

Massimo Santini has developed a front-end that interfaces Jython with WebGraph. It makes exploring small portions of very large graphs very easy and interactive.

Papers

  • A detailed description of the compression algorithms used in WebGraph, published in the proceedings of the Thirteenth International World–Wide Web Conference.

  • A mathematical analysis of the performance of γ, δ and ζ codes against power-law distributions.

  • Some quite surprising experiments showing that the transpose graph reacts very peculiarly to compression after lexicographical or Gray-code sorting.

  • A paper about HyperBall (then named HyperANF), our tool for computing an approximation of the neighbourhood function, reachable nodes and geometric centralities of massive graphs. More information can be found in this preprint.

  • HyperBall was used to find out that on average there are just four degrees of separation on Facebook, and the experiment was reported by the New York Times. Alas, the degrees were actually 3.74 (one less than the average distance), but the off-by-one between graph theory (“distance”) and sociology (“degrees of separation”) generated a lot of confusion.

  • A paper were we describe our efforts to compress one of the largest social graphs available—the graph of commits of the Software Heritage archive.

  • A paper about our effort to bring WebGraph to Rust.

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages