Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the DisjointSet data structure #521

Open
gissuebot opened this issue Oct 31, 2014 · 57 comments
Open

Add the DisjointSet data structure #521

gissuebot opened this issue Oct 31, 2014 · 57 comments

Comments

@gissuebot
Copy link

Original issue created by ogregoire on 2011-01-12 at 11:58 PM


It's a useful data structure that is not so easy to implement by itself, and that is very useful in algorithmics, such as in the graph theory, for instance.

@gissuebot
Copy link
Author

Original comment posted by [email protected] on 2011-01-13 at 01:25 AM


We have a few uses for such a thing ourselves. I started work on it once upon a time...

The name has always bugged me, though; "PartitionedSet" would seem to make more sense, but perhaps it's not worth going against the grain.


Status: Accepted
Labels: Type-Enhancement

@gissuebot
Copy link
Author

Original comment posted by [email protected] on 2011-01-13 at 01:26 AM


for the general interest: http://en.wikipedia.org/wiki/Disjoint-set_data_structure

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2011-02-02 at 06:05 PM


Any thoughts on what the API would look like?

@gissuebot
Copy link
Author

Original comment posted by [email protected] on 2011-02-03 at 03:07 AM


Disturbingly, I can't find my work on it anywhere.

Rough thoughts were: DisjointSet<E> interface containing the read-only methods. One to view as a Set<Set<E>>, one to view the union Set<E>, one to get the Set<E> for a given E, one to view the entire thing as an Equivalence<E>.

Mutable implementation, with or without specialized mutable interface. Method to declare e1 and e2 to be equivalent. This is where the traditional union-find algorithm ends up. Method to remove an E?

Immutable implementation. Presumably backed by an ImmutableMap<E, ImmutableSet<E>>.

Various other methods I forgot.

Louis, if you are up for taking a crack at this, this is something I feel we could push through.

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2011-02-03 at 03:45 AM


Absolutely up for it -- after my paper due Friday. ;)

@gissuebot
Copy link
Author

Original comment posted by jim.andreou on 2011-02-03 at 06:20 AM


The key thing in this structure is that the elements point to the set, instead of the opposite way which is what we typically mean when we say "a Set". So, please, don't subtype Set, don't try to shoehorn an iterator or a remove operation. The result would be negating the strengths of the structure; the quintessential union/find is best.

After some digging, here's an implementation of my own, some 4-5 years old.
http://code.google.com/p/flexigraph/source/browse/trunk/src/gr/forth/ics/util/Partition.java

(It seems I took the time to implement the best find() version, instead of the easiest, which is nice; single-pass, instead of two-pass or recursive).

Let me make a concrete recommendation now though. I'll call the structure "Partition", below:

 * Don't make Partition point to a user value. This is completely unnecessary; the user will associate each value to its Partition even if Partition offers the other direction. (I made this mistake)
 * Following from the above: don't introduce a type parameter. This artificially makes it difficult to merge Partition<S> and Partition<T>. (I made this mistake too).
 * No "iterator" or anything that requires the set to point to the elements, which defeats the purpose of the structure.
 * Expose the equivalence relation only. A singleton Equivalence<Partition> is a fine way to do it. Having Equivalence<Partition<?>> would look ugly, for a reason (because the type parameter isn't needed!). Don't expose the representative of an element (btw, remember to do the self-loop trick to denote the root). Other than the equivalence, just singleton() and merge() need to be offered.
 * Finally, consider offering this method:
public static <E> Equivalence<E> partition(Function<E, Partition> partitioner);
Which maps the equivalence relation of Partition to an equivalence relation of referrers of Partition.

I don't care too much about the "Partition" name, but I do care about the rest. Make a guy with a weak spot for graph algorithms happy - otherwise I'll go and whine to Robert Tarjan himself. Seriously. (Ok, perhaps not that seriously, but seriously, I'll be sad, this is a gem of a structure, lets preserve that).

@gissuebot
Copy link
Author

Original comment posted by [email protected] on 2011-02-03 at 02:02 PM


Ok, let's make sure we sort out what to do before you dig into this, Louis.

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2011-02-05 at 12:32 AM


I swear, I have a weaker spot for graph algorithms than you, and my biggest soft spot of all is for data structure gems!

I like some ideas from your approach, Jim, but I think it can be made considerably simpler. Specifically, I think the attached is pretty much your approach taken to its logical conclusion -- to remove the elements from the picture entirely, and treat the partitions as objects in their own right. The effect of p1.combine(p2) is to guarantee that p1.equals(p2) is true for all time, while maintaining that equals() is an equivalence relation. Elements don't come into it at all.

@gissuebot
Copy link
Author

Original comment posted by jim.andreou on 2011-02-05 at 01:01 AM


I like the simplicity of equals(), heck, that's supposed to model an equivalence class, and Equivalences already play nice with it, so it's fine.

About find(), I see you implemented the recursive one for simplicity (or laziness :)) - but we ought to provide the fastest.

Now, about the rank... I'm on the fence. I think without the rank optimization, one gets polylogarithmic complexity, instead of the inverse Ackerman (together with path compression). The rank itself "ought" to increase the memory overhead (I'm certain that some people more knowledgeable than me asserted that), but I'm not so sure, it must be VM dependant. On the VMs I'm usually on, there would be no observable difference to me, due to aligning effects (the existence of the Partition pointer would bump the cost of the object from 8 bytes to 16 bytes, so in this case, the second field is free).

But without that field, it would be possible to pull off a concurrent, non-blocking implementation (just via compare-and-swap on the pointer), which would be nice, because this structure does play well with partitioning things to parallelize, then merge the sets to compute some global result.

So, how about having an interface for this, making an sequential implementation w/ the rank, and leave the door open for a non-blocking one without the rank.

Btw, I was quick to play down the "Partition" name, but I think it's quite good (the other good names would be "Class" and "Set", obviously poor choices in this context :)).

(PS: I'm not going to argue with you about who is more addicted to graph algorithms :P )

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2011-02-05 at 01:08 AM


Well, yes, I was going for a quick-and-dirty implementation, so I was somewhat lazy.

I'm inclined to do the sequential implementation for the moment, with the lovely inverse-Ackermann complexity.

@gissuebot
Copy link
Author

Original comment posted by [email protected] on 2011-02-05 at 02:07 AM


I like the sound of the (very different from what I had been thinking!) direction you guys are taking this.

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2011-02-05 at 05:06 AM


Oh, yeah. I went through several drafts of my first response that were along the lines of "You're on crack! What a ridiculous data structure!" Then it was like, oh man, that's actually not half bad. Then it was like, but this could be so much simpler! Bam! Bam! .equals()!

So yeah.

@gissuebot
Copy link
Author

Original comment posted by jim.andreou on 2011-02-05 at 05:19 AM


Oh boy. How old is this guy? :)

@gissuebot
Copy link
Author

Original comment posted by jim.andreou on 2011-02-05 at 05:24 AM


More seriously, let me back off that interface idea. Rethinking it, that would create ugly problems, like sequential.{merge,equals}(nonSequential); // oops
So lets just have a single concrete class; if we need the other, we create a second, unrelated class.

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2011-02-05 at 05:50 AM


I think that's definitely the way to go, and I think a sequential version is the right default.

@gissuebot
Copy link
Author

Original comment posted by jim.andreou on 2011-02-14 at 06:31 PM


Just so I don't have this in my mind: Louis? Is this something we should expect you give a full contribution of? (I mean, the final product, optimized, documented -especially hashCode() and equals()-, some tests). Is it something you're already working on?

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2011-02-14 at 06:39 PM


Yeah, I'm doing the whole shebang.

@gissuebot
Copy link
Author

Original comment posted by jim.andreou on 2011-02-14 at 06:43 PM


Ok, great, thanks!

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2011-02-14 at 09:44 PM


Observation: I think this should go in base, not collect, since it's really not a collection in any way.

@gissuebot
Copy link
Author

Original comment posted by ray.a.conner on 2011-05-04 at 12:10 AM


I'm late to this discussion, so please forgive me. If I'm grokking the gist of the approach (referring to Wasserman's posted impl):

  • the user is responsible for associating user-provided objects with Partition instances
  • Partitions never disappear, but are merged as "flatly" as possible

I really like the simplicity, but I have some questions about how one would use it.

Wouldn't algorithms which start with a state of one object per partition be somewhat inefficient? By that, I mean that the user of the library will be keeping around a HashMap< E, Partition >, and there will permanently be just as many Partitions as there are elements. Although I suppose if the user didn't do it, then the Partition impl would have to. So maybe this is just shifting the burden, with the possibility that the user won't care and it wouldn't be needed at all.

Having run a connected components algorithm, I can easily determine whether any two vertices are in the same connected component (via the Equivalence). But how can I determine how many components there are? Or, how can I list the vertices in each component? Do you need to expose the find() method to allow answering questions like this, or is the intent that the user should be keeping track of this kind of information? Decrementing the number of components every time a combine() is successful is easy enough, and Sets.union() or similar handles the latter question (with the user keeping track of all the Sets), but this does seem like exactly the kind of thing someone might want to use Partition for.

@gissuebot
Copy link
Author

Original comment posted by jim.andreou on 2011-05-04 at 12:28 AM


"- the user is responsible for associating user-provided objects with Partition instances"
Yes. Note that a more efficient alternative to Map<E, Partition>, is a Partition field in E itself.

"- Partitions never disappear, but are merged as "flatly" as possible"
They disappear at the discretion of the garbage collector :)

One object per partition is the minimum you can get.

"But how can I determine how many components there are?"
A good start would be Multimaps#index(Iterable<E>, Function<E, Partition>), which gives a Multimap<Partition, E>. But to turn it into a more useful Multimap<E, E>, one would require the inverse relation, Map<Partition, E>, which leads to rather verbose code (trust me, I tried it).

My original (internal) proposal included a "classify" method which for the same arguments returned Multimap<E, E> instead, directly. It's still up in the air (seconds ago I re-suggested that, with the return type being Map<E, Set<E>>, lets see what happens to that). That would answer all your following questions; exposing find() is still unnecessary.

@gissuebot
Copy link
Author

Original comment posted by ray.a.conner on 2011-05-04 at 06:31 PM


Maybe a contrived example would help, and you can tell me where my thinking has gone wrong. I start out with a big graph...

  Map< Node, Partition > map = Maps.newHashMap();
  for( node : nodes ) {
    map.put( node, Partition.newPartition() );
  }

Now, inelegant brute-force connected components:

  int size = nodes.size();
  for( edge : edges ) {
    Partition pTail = map.get( edge.tail );
    Partition pHead = map.get( edge.head );
    if( pTail.combine( pHead ) ) {
      size--;
    }
    if( size == 1 ) {
      break;
    }
  }

The calculation is done, and the Partition Equivalence would work. However, there are still 1M Partition instances, one per node. From the information I have, I cannot construct a Function<E, Partition> that is "normalized". So Multimaps.index() wouldn't be useful.

This edit reduces the number of Partitions in use by one allowing pHead to be gc'd, but only if pHead happens to only contain the one node.

if( pTail.combine( pHead ) ) {
  map.put( edge.head, pTail );
  size--;
}

As near as I can tell, the only way to keep track of this information is to maintain an actual Map<Partition, Set<E>> during the algorithm, in addition to the existing Map. A single graph data structure would be simpler, and then I would effectively be performing the exact same function as Partition, and I don't need the class at all.

What am I missing?

@gissuebot
Copy link
Author

Original comment posted by jim.andreou on 2011-05-04 at 06:59 PM


#index you would get a Multimap<Partition, E>. Just iterate over the
<Partition, E> entries, and nominate the key Partition as the "canonical"
one for the value E. Seems good enough.

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2011-05-04 at 08:33 PM


Not true! After a combine() is performed, the only way to tell the
Partition instances apart is object identity, which Multimaps.index()
doesn't use. Try it yourself: once Partitions have been combined, they're
equal as far as Object.equals() and Object.hashCode() are concerned, which
is all Multimaps.index() cares about.

Louis

@gissuebot
Copy link
Author

Original comment posted by jim.andreou on 2011-05-04 at 08:53 PM


I don't understand who you are responding to and what is "not true", could you clarify?

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2011-05-04 at 09:45 PM


Sorry, I tried to respond in an email instead of in the comment interface.

The calculation is done, and the Partition Equivalence would work. However, there
are still 1M Partition instances, one per node. From the information I have, I
cannot construct a Function<E, Partition> that is "normalized". So Multimaps.index()
wouldn't be useful.

Not true! After a combine() is performed, the only way to tell the
Partition instances apart is object identity, which Multimaps.index()
doesn't use. Try it yourself: once Partitions have been combined, they're
equal as far as Object.equals() and Object.hashCode() are concerned, which
is all Multimaps.index() cares about.

Louis

@gissuebot
Copy link
Author

Original comment posted by ray.a.conner on 2011-05-06 at 09:57 PM


[slaps forehead, utters "Doh!", reaches for beer...]

@gissuebot
Copy link
Author

Original comment posted by finnw1 on 2011-05-21 at 12:03 PM


@jim.andreou, does it matter if there is a race between two threads comparing/updating a rank? If I understand the code correctly this would result in a suboptimal tree, but still a valid one.

@gissuebot
Copy link
Author

Original comment posted by jim.andreou on 2011-05-21 at 06:20 PM


Yes, it matters, eg two threads might try to link a partition to different partitions, and one would overwrite the other. I think the only way this could be used in a multithreaded environment would be grabbing a common lock. (Even locking the partitions themselves, say in Ordering.arbitrary() order, wouldn't help).

@gissuebot
Copy link
Author

Original comment posted by finnw1 on 2011-05-22 at 12:14 PM


As far as I can see the only way to get a corrupted tree (as opposed to a suboptimal one) is that two nodes should have a common ancestor but do not (e.g. because thread1 sets a.parent = b and thread2 (not seeing thread1's change) sets a.parent = c.

There is no danger of spurious merges, because if two partitions are not meant to be merged then the merge method never sees pointers to both at once.

All you need to ensure is that before a.merge(b) returns, a and b have a common ancestor (and that all threads agree on the fact.) Anything else is just optimization. I cannot see any way in which this constraint could be violated that would not be solved by a compareAndSet() loop.

E.g. this should work:

public ConcurrentPartition merge(ConcurrentPartition other) {
    ConcurrentPartition root1, root2;
    while ((root1 = other.find()) != (root2 = find())) {
        if (root1.rank < root2.rank) {
            if (root1.parentRef.compareAndSet(root1, root2)) {
                return root2;
            }
        } else if (root1.rank > root2.rank) {
            if (root2.parentRef.compareAndSet(root2, root1)) {
                return root1;
            }
        } else {
            if (root2.parentRef.compareAndSet(root2, root1)) {
                root1.rank++;
                return root1;
            }
        }
    }
    return root2;
}

The worst that could happen is that each thread only performs only one call to merge() and no thread sees any other thread's updates to rank (easily simulated by making rank ThreadLocal). Then you end up with a singly-linked list, but that would still give you the right answers for equals() and hashCode() and the path compression would still get a chance to reduce the depth.

You might need global synchronization at a higher level, to ensure that all merge()s have completed before you start performing comparisons with equals() and hashCode(), but that does not affect the implementation of merge().

Feel free to prove me wrong.

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2011-09-20 at 06:50 PM


We never got around to this. Should it be revived?

@gissuebot
Copy link
Author

Original comment posted by [email protected] on 2011-09-20 at 06:57 PM


Oh. I had a dormant change list, forgot about it. The main problem in user code is not related to Partition - it's that it's just Too Hard (TM) to make a default value map in java (even with guava, unless that default value happens to be a collection).

To take some verbosity out, basically, I meant to introduce a Partition#newPartitionMap(), returning a Map<E, Partition>, where the user wouldn't do the little "is the key in the map? if not, add this entry" dance.

Kevin had proposed a Function<E, Partition> instead of default-valued map, but this wouldn't do since a common operation turned out to be accessing the keySet() of said map.

It was a complicated discussion, taking more time than it was worth it, so I put it in the backburner.

@gissuebot
Copy link
Author

Original comment posted by [email protected] on 2011-09-21 at 01:35 AM


I actually have some thoughts on this I will try to pull together soon.

@gissuebot
Copy link
Author

Original comment posted by [email protected] on 2011-12-10 at 04:01 PM


(No comment entered for this change.)


Labels: Package-Collect

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2012-01-17 at 03:54 AM


Proposal:

Introduce a Partitioning<E> class that manages the "default-valued map" details internally. If we want to support key-set iteration, we can do that, but manage it internally.

Partitioning<E> supports Partition getPartition(E e), which returns an object of the Partition type we'd been playing with earlier. Additionally supports inSamePart(E, e), and all of that.

For the moment, Partitioning locks the whole thing when doing any mutations. If we design a more effective concurrent version of the structure later, then that's great, but we guarantee thread-safety now (with a qualifier about performance) and try to optimize later.

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2012-01-17 at 04:34 AM


For reference: Wikipedia links to http://citeseer.ist.psu.edu/anderson94waitfree.html, "Wait-free Parallel Algorithms for the Union Find Problem." It sounds, well, pretty much exactly like what we wanted?

@gissuebot
Copy link
Author

Original comment posted by [email protected] on 2012-05-30 at 07:43 PM


(No comment entered for this change.)


Labels: -Type-Enhancement, Type-Addition

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2012-06-14 at 10:21 PM


Here is a straw API to play with:

lowasser/guava-experimental@e84cb02#diff-1

To summarize: UnionFindEquivalence is an Equivalence<Object> implementation that starts out equivalent to Equivalence.equals(), but allows you to merge equivalence classes. Internally, it holds a Map<Object, Partition> for objects that are in equivalence classes of size > 1.

No, it doesn't let you get the objects back out for the moment, but I've written a couple of algorithms using disjoint-set data structures and it's satisfied my needs there. In a pinch, it could get retrofitted later for some of those operations, but this feels relatively neat and tidy, and addresses the use cases I'm aware of.

@gissuebot
Copy link
Author

Original comment posted by thomas.andreas.jung on 2012-06-15 at 07:07 AM


Personally, I would prefer an immutable Equivalence created with a builder:
UnionFind equiv = UnionFindBuilder.builder().merge(a, b).merge(c, d).build();

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2012-06-15 at 02:54 PM


@thomas: For all of the applications I came across, such an equivalence would be of little use. Union-find data structures need to be dynamic to serve their traditional applications.

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2012-06-15 at 02:55 PM


Specific examples: Kruskal's algorithm is utterly pointless without a dynamic union-find structure, and several variations on it are equally pointless.

@gissuebot
Copy link
Author

Original comment posted by thomas.andreas.jung on 2012-06-16 at 07:30 PM


Wouldn't
equiv = UnionFindEquivalence.create()
equiv.merge(a, b);
w = equiv.wrapp(a);
w.equals(equiv.wrapp(c)) -> false
equiv.merge(b, c);
w.equals(equiv.wrapp(c)) -> true
break the equals contract?

@gissuebot
Copy link
Author

Original comment posted by wasserman.louis on 2012-06-16 at 09:50 PM


Well, the Object.equals() contract specifies

"It is consistent: for any non-null reference values x and y, multiple invocations of x.equals(y) consistently return true or consistently return false, provided no information used in equals comparisons on the objects is modified."

Of course, the last part is crucial: "provided no information used in equals comparisons on the objects is modified."

The doc for Equivalence isn't quite so specific, although it has allusions along the same lines, e.g. for hash: "...as long as x remains unchanged according to the definition of the equivalence."

Essentially, I'm suggesting that that exception apply here. It remains the case that a non-dynamic version is useless for the traditional applications of this data structure...and that it makes a lot of sense to model this as an equivalence relation rather than a Map<E, Partition> or something along those lines.

@gissuebot
Copy link
Author

Original comment posted by thomas.andreas.jung on 2012-06-18 at 06:46 AM


Okay, I remembered a more strict version of the contract, i.e. I'd always implement equals in this way. One effect with a weaker implementation is that objects disappear from collections:

UnionFindEquivalence equiv = UnionFindEquivalence.create();
Wrapper<Integer> wrap = equiv.wrap(1);
HashSet<Wrapper<Integer>> set = new HashSet<Wrapper<Integer>>();
set.add(wrap);
System.out.println(set.contains(wrap)); //true
equiv.merge(1, 2);
System.out.println(set.contains(wrap)); //false

You're right that an immutable version of a UnionFindEquivalence is useless (unless you find a practical persistent Union-find data structure).

@kevinb9n
Copy link
Contributor

Now that Partition is getting a fair amount of usage we will look into improvements here.

@Tarrasch
Copy link

Any updates on this? @kevinb9n can you link to what a "Partition" is?

@jrtom
Copy link
Member

jrtom commented May 26, 2016

We've been having some internal discussions on this topic recently, and with common.graph having been making a lot of progress recently, there's been more interest in sorting out graph-related things like this.

My two cents:

  • The right name for this thing is "Partition": a partition of S is the correct mathematical term for a spanning set of disjoint subsets of S.
    • "Union-Find" is the name of an algorithm that one can use to generate a partition;
    • "disjoint set" doesn't work as a name for this thing, because "set" (singular) makes it unclear what the disjointedness refers to, and it doesn't capture the fact that the parts must span the original set.
  • This object should provide the following capabilities:
    • return the Set in the Partition that a specified element [in S] belongs to
    • return a Collection of these (disjoint) Sets

Reference: https://en.wikipedia.org/wiki/Partition_of_a_set

I don't know that this is the API that we're going to end up with, but that's my current opinion for what it should look like.

@mozinrat
Copy link

Hi team, is there finally an implementation in guava yet? If yes what its called.

@jrtom jrtom self-assigned this May 21, 2017
@jrtom
Copy link
Member

jrtom commented May 21, 2017

@mozinrat Not yet, I'm afraid. I still have this issue on my plate, but I haven't had the cycles to push it yet. I'd like to get it resolved myself, as it would make it easier for us to release a couple of internal graph algorithm implementations.

@cgdecker cgdecker added the P3 no SLO label Jul 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants