-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optional cache in tasks? #541
Comments
Hi @montesmariana! Thank you for your question! If you don't want to keep the the output of a specific task you can remove it and the task should re-run if it's used again. But you can also add an additional argument to your task Let me know if that helps! |
Hi, thanks for your answer! |
It sounds like the idea is basically to automatically remove the cache directory once all downstream nodes have consumed the outputs? That's not supported but seems sensible. I think the API could be as simple as |
yes, that's a good idea, we should support removing the output after using it |
Great, thank you!! I look forward to that :) |
Is it possible to deactivate the cache in specific tasks within a workflow?
Context
I'm running a workflow with multiple steps and a long sequence of inputs that take different values. The splitter function is great here; without pydra, this can turn into very ugly and complicated nested loops. However, for some tasks it is preferable to re-do the computation in different workflow runs rather than taking up storage with the output. I haven't found anything in the code that allows the user to turn off the cache storage for tasks within a workflow. Is it not supported, or did I just miss it?
I did think of merging the tasks into larger tasks that only output what I need to store. But that also means that it will run multiple times within each run of the workflow, for all the input combinations, and that implies unnecessary computation.
Example
The relationship between speed of computation and size of the output in Functions B and D makes it more convenient to rerun them when needed than to use storage for their cache. But if they are fusioned into one function, function B will be rerun more times than necessary.
I would appreciate any advice or guidance :)
Thank you for your amazing work in this package!
The text was updated successfully, but these errors were encountered: