-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the DisjointSet data structure #521
Comments
Original comment posted by [email protected] on 2011-01-13 at 01:25 AM We have a few uses for such a thing ourselves. I started work on it once upon a time... The name has always bugged me, though; "PartitionedSet" would seem to make more sense, but perhaps it's not worth going against the grain. Status: |
Original comment posted by [email protected] on 2011-01-13 at 01:26 AM for the general interest: http://en.wikipedia.org/wiki/Disjoint-set_data_structure |
Original comment posted by wasserman.louis on 2011-02-02 at 06:05 PM Any thoughts on what the API would look like? |
Original comment posted by [email protected] on 2011-02-03 at 03:07 AM Disturbingly, I can't find my work on it anywhere. Rough thoughts were: DisjointSet<E> interface containing the read-only methods. One to view as a Set<Set<E>>, one to view the union Set<E>, one to get the Set<E> for a given E, one to view the entire thing as an Equivalence<E>. Mutable implementation, with or without specialized mutable interface. Method to declare e1 and e2 to be equivalent. This is where the traditional union-find algorithm ends up. Method to remove an E? Immutable implementation. Presumably backed by an ImmutableMap<E, ImmutableSet<E>>. Various other methods I forgot. Louis, if you are up for taking a crack at this, this is something I feel we could push through. |
Original comment posted by wasserman.louis on 2011-02-03 at 03:45 AM Absolutely up for it -- after my paper due Friday. ;) |
Original comment posted by jim.andreou on 2011-02-03 at 06:20 AM The key thing in this structure is that the elements point to the set, instead of the opposite way which is what we typically mean when we say "a Set". So, please, don't subtype Set, don't try to shoehorn an iterator or a remove operation. The result would be negating the strengths of the structure; the quintessential union/find is best. After some digging, here's an implementation of my own, some 4-5 years old. (It seems I took the time to implement the best find() version, instead of the easiest, which is nice; single-pass, instead of two-pass or recursive). Let me make a concrete recommendation now though. I'll call the structure "Partition", below: * Don't make Partition point to a user value. This is completely unnecessary; the user will associate each value to its Partition even if Partition offers the other direction. (I made this mistake) I don't care too much about the "Partition" name, but I do care about the rest. Make a guy with a weak spot for graph algorithms happy - otherwise I'll go and whine to Robert Tarjan himself. Seriously. (Ok, perhaps not that seriously, but seriously, I'll be sad, this is a gem of a structure, lets preserve that). |
Original comment posted by [email protected] on 2011-02-03 at 02:02 PM Ok, let's make sure we sort out what to do before you dig into this, Louis. |
Original comment posted by wasserman.louis on 2011-02-05 at 12:32 AM I swear, I have a weaker spot for graph algorithms than you, and my biggest soft spot of all is for data structure gems! I like some ideas from your approach, Jim, but I think it can be made considerably simpler. Specifically, I think the attached is pretty much your approach taken to its logical conclusion -- to remove the elements from the picture entirely, and treat the partitions as objects in their own right. The effect of p1.combine(p2) is to guarantee that p1.equals(p2) is true for all time, while maintaining that equals() is an equivalence relation. Elements don't come into it at all. |
Original comment posted by jim.andreou on 2011-02-05 at 01:01 AM I like the simplicity of equals(), heck, that's supposed to model an equivalence class, and Equivalences already play nice with it, so it's fine. About find(), I see you implemented the recursive one for simplicity (or laziness :)) - but we ought to provide the fastest. Now, about the rank... I'm on the fence. I think without the rank optimization, one gets polylogarithmic complexity, instead of the inverse Ackerman (together with path compression). The rank itself "ought" to increase the memory overhead (I'm certain that some people more knowledgeable than me asserted that), but I'm not so sure, it must be VM dependant. On the VMs I'm usually on, there would be no observable difference to me, due to aligning effects (the existence of the Partition pointer would bump the cost of the object from 8 bytes to 16 bytes, so in this case, the second field is free). But without that field, it would be possible to pull off a concurrent, non-blocking implementation (just via compare-and-swap on the pointer), which would be nice, because this structure does play well with partitioning things to parallelize, then merge the sets to compute some global result. So, how about having an interface for this, making an sequential implementation w/ the rank, and leave the door open for a non-blocking one without the rank. Btw, I was quick to play down the "Partition" name, but I think it's quite good (the other good names would be "Class" and "Set", obviously poor choices in this context :)). (PS: I'm not going to argue with you about who is more addicted to graph algorithms :P ) |
Original comment posted by wasserman.louis on 2011-02-05 at 01:08 AM Well, yes, I was going for a quick-and-dirty implementation, so I was somewhat lazy. I'm inclined to do the sequential implementation for the moment, with the lovely inverse-Ackermann complexity. |
Original comment posted by [email protected] on 2011-02-05 at 02:07 AM I like the sound of the (very different from what I had been thinking!) direction you guys are taking this. |
Original comment posted by wasserman.louis on 2011-02-05 at 05:06 AM Oh, yeah. I went through several drafts of my first response that were along the lines of "You're on crack! What a ridiculous data structure!" Then it was like, oh man, that's actually not half bad. Then it was like, but this could be so much simpler! Bam! Bam! .equals()! So yeah. |
Original comment posted by jim.andreou on 2011-02-05 at 05:19 AM Oh boy. How old is this guy? :) |
Original comment posted by jim.andreou on 2011-02-05 at 05:24 AM More seriously, let me back off that interface idea. Rethinking it, that would create ugly problems, like sequential.{merge,equals}(nonSequential); // oops |
Original comment posted by wasserman.louis on 2011-02-05 at 05:50 AM I think that's definitely the way to go, and I think a sequential version is the right default. |
Original comment posted by jim.andreou on 2011-02-14 at 06:31 PM Just so I don't have this in my mind: Louis? Is this something we should expect you give a full contribution of? (I mean, the final product, optimized, documented -especially hashCode() and equals()-, some tests). Is it something you're already working on? |
Original comment posted by wasserman.louis on 2011-02-14 at 06:39 PM Yeah, I'm doing the whole shebang. |
Original comment posted by jim.andreou on 2011-02-14 at 06:43 PM Ok, great, thanks! |
Original comment posted by wasserman.louis on 2011-02-14 at 09:44 PM Observation: I think this should go in base, not collect, since it's really not a collection in any way. |
Original comment posted by ray.a.conner on 2011-05-04 at 12:10 AM I'm late to this discussion, so please forgive me. If I'm grokking the gist of the approach (referring to Wasserman's posted impl):
I really like the simplicity, but I have some questions about how one would use it. Wouldn't algorithms which start with a state of one object per partition be somewhat inefficient? By that, I mean that the user of the library will be keeping around a HashMap< E, Partition >, and there will permanently be just as many Partitions as there are elements. Although I suppose if the user didn't do it, then the Partition impl would have to. So maybe this is just shifting the burden, with the possibility that the user won't care and it wouldn't be needed at all. Having run a connected components algorithm, I can easily determine whether any two vertices are in the same connected component (via the Equivalence). But how can I determine how many components there are? Or, how can I list the vertices in each component? Do you need to expose the find() method to allow answering questions like this, or is the intent that the user should be keeping track of this kind of information? Decrementing the number of components every time a combine() is successful is easy enough, and Sets.union() or similar handles the latter question (with the user keeping track of all the Sets), but this does seem like exactly the kind of thing someone might want to use Partition for. |
Original comment posted by jim.andreou on 2011-05-04 at 12:28 AM "- the user is responsible for associating user-provided objects with Partition instances" "- Partitions never disappear, but are merged as "flatly" as possible" One object per partition is the minimum you can get. "But how can I determine how many components there are?" My original (internal) proposal included a "classify" method which for the same arguments returned Multimap<E, E> instead, directly. It's still up in the air (seconds ago I re-suggested that, with the return type being Map<E, Set<E>>, lets see what happens to that). That would answer all your following questions; exposing find() is still unnecessary. |
Original comment posted by ray.a.conner on 2011-05-04 at 06:31 PM Maybe a contrived example would help, and you can tell me where my thinking has gone wrong. I start out with a big graph... Map< Node, Partition > map = Maps.newHashMap(); Now, inelegant brute-force connected components: int size = nodes.size(); The calculation is done, and the Partition Equivalence would work. However, there are still 1M Partition instances, one per node. From the information I have, I cannot construct a Function<E, Partition> that is "normalized". So Multimaps.index() wouldn't be useful. This edit reduces the number of Partitions in use by one allowing pHead to be gc'd, but only if pHead happens to only contain the one node.
As near as I can tell, the only way to keep track of this information is to maintain an actual Map<Partition, Set<E>> during the algorithm, in addition to the existing Map. A single graph data structure would be simpler, and then I would effectively be performing the exact same function as Partition, and I don't need the class at all. What am I missing? |
Original comment posted by jim.andreou on 2011-05-04 at 06:59 PM #index you would get a Multimap<Partition, E>. Just iterate over the |
Original comment posted by wasserman.louis on 2011-05-04 at 08:33 PM Not true! After a combine() is performed, the only way to tell the Louis |
Original comment posted by jim.andreou on 2011-05-04 at 08:53 PM I don't understand who you are responding to and what is "not true", could you clarify? |
Original comment posted by wasserman.louis on 2011-05-04 at 09:45 PM Sorry, I tried to respond in an email instead of in the comment interface.
Not true! After a combine() is performed, the only way to tell the Louis |
Original comment posted by ray.a.conner on 2011-05-06 at 09:57 PM [slaps forehead, utters "Doh!", reaches for beer...] |
Original comment posted by finnw1 on 2011-05-21 at 12:03 PM @jim.andreou, does it matter if there is a race between two threads comparing/updating a rank? If I understand the code correctly this would result in a suboptimal tree, but still a valid one. |
Original comment posted by jim.andreou on 2011-05-21 at 06:20 PM Yes, it matters, eg two threads might try to link a partition to different partitions, and one would overwrite the other. I think the only way this could be used in a multithreaded environment would be grabbing a common lock. (Even locking the partitions themselves, say in Ordering.arbitrary() order, wouldn't help). |
Original comment posted by finnw1 on 2011-05-22 at 12:14 PM As far as I can see the only way to get a corrupted tree (as opposed to a suboptimal one) is that two nodes should have a common ancestor but do not (e.g. because thread1 sets a.parent = b and thread2 (not seeing thread1's change) sets a.parent = c. There is no danger of spurious merges, because if two partitions are not meant to be merged then the merge method never sees pointers to both at once. All you need to ensure is that before a.merge(b) returns, a and b have a common ancestor (and that all threads agree on the fact.) Anything else is just optimization. I cannot see any way in which this constraint could be violated that would not be solved by a compareAndSet() loop. E.g. this should work:
The worst that could happen is that each thread only performs only one call to merge() and no thread sees any other thread's updates to rank (easily simulated by making rank ThreadLocal). Then you end up with a singly-linked list, but that would still give you the right answers for equals() and hashCode() and the path compression would still get a chance to reduce the depth. You might need global synchronization at a higher level, to ensure that all merge()s have completed before you start performing comparisons with equals() and hashCode(), but that does not affect the implementation of merge(). Feel free to prove me wrong. |
Original comment posted by wasserman.louis on 2011-09-20 at 06:50 PM We never got around to this. Should it be revived? |
Original comment posted by [email protected] on 2011-09-20 at 06:57 PM Oh. I had a dormant change list, forgot about it. The main problem in user code is not related to Partition - it's that it's just Too Hard (TM) to make a default value map in java (even with guava, unless that default value happens to be a collection). To take some verbosity out, basically, I meant to introduce a Partition#newPartitionMap(), returning a Map<E, Partition>, where the user wouldn't do the little "is the key in the map? if not, add this entry" dance. Kevin had proposed a Function<E, Partition> instead of default-valued map, but this wouldn't do since a common operation turned out to be accessing the keySet() of said map. It was a complicated discussion, taking more time than it was worth it, so I put it in the backburner. |
Original comment posted by [email protected] on 2011-09-21 at 01:35 AM I actually have some thoughts on this I will try to pull together soon. |
Original comment posted by [email protected] on 2011-12-10 at 04:01 PM (No comment entered for this change.) Labels: |
Original comment posted by wasserman.louis on 2012-01-17 at 03:54 AM Proposal: Introduce a Partitioning<E> class that manages the "default-valued map" details internally. If we want to support key-set iteration, we can do that, but manage it internally. Partitioning<E> supports Partition getPartition(E e), which returns an object of the Partition type we'd been playing with earlier. Additionally supports inSamePart(E, e), and all of that. For the moment, Partitioning locks the whole thing when doing any mutations. If we design a more effective concurrent version of the structure later, then that's great, but we guarantee thread-safety now (with a qualifier about performance) and try to optimize later. |
Original comment posted by wasserman.louis on 2012-01-17 at 04:34 AM For reference: Wikipedia links to http://citeseer.ist.psu.edu/anderson94waitfree.html, "Wait-free Parallel Algorithms for the Union Find Problem." It sounds, well, pretty much exactly like what we wanted? |
Original comment posted by [email protected] on 2012-05-30 at 07:43 PM (No comment entered for this change.) Labels: - |
Original comment posted by wasserman.louis on 2012-06-14 at 10:21 PM Here is a straw API to play with: lowasser/guava-experimental@e84cb02#diff-1 To summarize: UnionFindEquivalence is an Equivalence<Object> implementation that starts out equivalent to Equivalence.equals(), but allows you to merge equivalence classes. Internally, it holds a Map<Object, Partition> for objects that are in equivalence classes of size > 1. No, it doesn't let you get the objects back out for the moment, but I've written a couple of algorithms using disjoint-set data structures and it's satisfied my needs there. In a pinch, it could get retrofitted later for some of those operations, but this feels relatively neat and tidy, and addresses the use cases I'm aware of. |
Original comment posted by thomas.andreas.jung on 2012-06-15 at 07:07 AM Personally, I would prefer an immutable Equivalence created with a builder: |
Original comment posted by wasserman.louis on 2012-06-15 at 02:54 PM @thomas: For all of the applications I came across, such an equivalence would be of little use. Union-find data structures need to be dynamic to serve their traditional applications. |
Original comment posted by wasserman.louis on 2012-06-15 at 02:55 PM Specific examples: Kruskal's algorithm is utterly pointless without a dynamic union-find structure, and several variations on it are equally pointless. |
Original comment posted by thomas.andreas.jung on 2012-06-16 at 07:30 PM Wouldn't |
Original comment posted by wasserman.louis on 2012-06-16 at 09:50 PM Well, the Object.equals() contract specifies "It is consistent: for any non-null reference values x and y, multiple invocations of x.equals(y) consistently return true or consistently return false, provided no information used in equals comparisons on the objects is modified." Of course, the last part is crucial: "provided no information used in equals comparisons on the objects is modified." The doc for Equivalence isn't quite so specific, although it has allusions along the same lines, e.g. for hash: "...as long as x remains unchanged according to the definition of the equivalence." Essentially, I'm suggesting that that exception apply here. It remains the case that a non-dynamic version is useless for the traditional applications of this data structure...and that it makes a lot of sense to model this as an equivalence relation rather than a Map<E, Partition> or something along those lines. |
Original comment posted by thomas.andreas.jung on 2012-06-18 at 06:46 AM Okay, I remembered a more strict version of the contract, i.e. I'd always implement equals in this way. One effect with a weaker implementation is that objects disappear from collections: UnionFindEquivalence equiv = UnionFindEquivalence.create(); You're right that an immutable version of a UnionFindEquivalence is useless (unless you find a practical persistent Union-find data structure). |
Now that Partition is getting a fair amount of usage we will look into improvements here. |
Any updates on this? @kevinb9n can you link to what a "Partition" is? |
We've been having some internal discussions on this topic recently, and with common.graph having been making a lot of progress recently, there's been more interest in sorting out graph-related things like this. My two cents:
Reference: https://en.wikipedia.org/wiki/Partition_of_a_set I don't know that this is the API that we're going to end up with, but that's my current opinion for what it should look like. |
Hi team, is there finally an implementation in guava yet? If yes what its called. |
@mozinrat Not yet, I'm afraid. I still have this issue on my plate, but I haven't had the cycles to push it yet. I'd like to get it resolved myself, as it would make it easier for us to release a couple of internal graph algorithm implementations. |
Original issue created by ogregoire on 2011-01-12 at 11:58 PM
It's a useful data structure that is not so easy to implement by itself, and that is very useful in algorithmics, such as in the graph theory, for instance.
The text was updated successfully, but these errors were encountered: