-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fuel server and Blocks job resumption #157
Comments
I've given it a little bit of thought, but don't really have a clear idea of how it could work. A big problem are the buffers. The current method is quite fast partly because there is no request/reply protocol, so the server can just push data until all the ZMQ and system TCP buffers are full. However, that means that when the client process requests the server's state, that state is not in sync with what the client has received up to that point. If we limit the checkpointing/resumption to fixed intervals e.g. only in between epochs, it's probably a lot more feasible. This might not be as bad as a limitation as it sounds, because we could write a data stream that splits a long epoch up into smaller epochs, so that we can save on regular intervals (might make things slightly more complicated if you want to e.g. average something over the full epoch though). In that case, I wonder whether we really need custom |
I think we can work around the buffers issue, actually, if we assume that there is a control connection fot initiating epochs, and before we start an epoch on the server side, we pickle our data stream so that the pre-epoch DataStream is available upon request (we might even send it to the client once per epoch if that's not too expensive). The client keeps track of how many batches it had received in a given epoch, and records it if the job dies. Upon resumption, the pickled pre-epoch stream is sent back to the server along with the number of batched already seen; the server queues it up to the right place before starting to send any batches over the wire. We could potentially put this functionality in I think this gets us everything we need for arbitrary resumability, or am I missing something? |
Ah, clever, that sounds like it would work yes. That said, I'm not sure what people find a higher priority: Resumption, or parallelizing the data processing? Otherwise it might be worth implementing the latter first, because my guess is that the code will overlap a bit. |
Just seen this and my 2 cents: I'd value being able to parallelize the data processing more highly than the resumability stuff. Resumability seems to constrain the implementation a lot and is easy to break -- in my case at least implementing it for custom datasets didn't seem worth it. Without having to worry about resumability I have more freedom, e.g. to use non-picklable things like generators to write more idiomatic iteration code. I tried paralellising my datastream transforms using concurrent.futures.ProcessPoolExecutor.map, but the pickling overhead seems to kill it. If this more efficient serialization and zmq-based approach could generalise to a pool of worker processes that would be awesome! It would also be awesome if ServerDataStream could handle forking the server process(es) for you :) |
All of this is definitely going to be optional, but given that most of the
primary developers are scientists, fully reproducible runs (even
interrupted runs) are a moderately important use case.
The current implementation is mostly there so that you can "pipeline"
transforms and thus hide their cost outside the training process that way,
there hasn't been much consideration given to the multiple preprocessing
worker case. There is a divide and conquer pattern in the ZMQ docs that
could be leveraged but the way they do it with PUSH and PULL sockets isn't
terribly robust and has a few race conditions in corner cases so we'd need
to harden it a bit.
|
OK, yeah that makes sense. I need some level of reproducibility too, but unfortunately also need to do some relatively expensive transformations on-the-fly, which is why I'm particularly interested in parallelisation and backgrounding the transformation work. If I could precompute them then parallelisation would be less of an issue. For now I'm using ThreadPoolExecutor.map with some Cython code that gives up the GIL which gives me a reasonable speed boost, but it's not a general solution. (Perhaps implementing a concurrent.futures.Executor based on zmq and then making the executor pluggable could be a nice approach to both parallelisation and/or backgrounding if you do go down that road?) |
Currently (correct) job resumption and the "server process" are an either/or. It would be nice to at least think about fixing this.
@bartvm Have you thought at all about how this could work?
Seems like we'd need to define a request/reply protocol that allows the client to request all the bits of state necessary for the streams in the pipeline (in theory, dataset objects themselves ought to be stateless...). This might not be as bad as it sounds: every transformer just needs a method like so:
The text was updated successfully, but these errors were encountered: