Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional cache in tasks? #541

Open
montesmariana opened this issue May 10, 2022 · 5 comments
Open

Optional cache in tasks? #541

montesmariana opened this issue May 10, 2022 · 5 comments
Labels
question Further information is requested

Comments

@montesmariana
Copy link
Contributor

Is it possible to deactivate the cache in specific tasks within a workflow?

Context
I'm running a workflow with multiple steps and a long sequence of inputs that take different values. The splitter function is great here; without pydra, this can turn into very ugly and complicated nested loops. However, for some tasks it is preferable to re-do the computation in different workflow runs rather than taking up storage with the output. I haven't found anything in the code that allows the user to turn off the cache storage for tasks within a workflow. Is it not supported, or did I just miss it?

I did think of merging the tasks into larger tasks that only output what I need to store. But that also means that it will run multiple times within each run of the workflow, for all the input combinations, and that implies unnecessary computation.

Example

  • Function A creates a list of words.
  • Function B filters the list of words according to different criteria (you get two or three different subsets)
  • Function C creates a large, heavy, computationally expensive table with the words from function A.
  • Function D subsets the table created by C, using a filtered list from function B and other parameters.
  • Function E performs computations on D based on different instructions.

The relationship between speed of computation and size of the output in Functions B and D makes it more convenient to rerun them when needed than to use storage for their cache. But if they are fusioned into one function, function B will be rerun more times than necessary.


I would appreciate any advice or guidance :)
Thank you for your amazing work in this package!

@montesmariana montesmariana added the question Further information is requested label May 10, 2022
@djarecka
Copy link
Collaborator

Hi @montesmariana! Thank you for your question!

If you don't want to keep the the output of a specific task you can remove it and the task should re-run if it's used again.

But you can also add an additional argument to your task rerun=True, this should force the task to be always rerun and do not use the cache output.

Let me know if that helps!

@montesmariana
Copy link
Contributor Author

Hi, thanks for your answer!
How can I remove the output, then? Sorry if it's a stupid question, but I don't think I've seen how.
My concern is not so much about not using the cache output but about avoiding or removing its storage.

@effigies
Copy link
Contributor

It sounds like the idea is basically to automatically remove the cache directory once all downstream nodes have consumed the outputs? That's not supported but seems sensible.

I think the API could be as simple as shutil.rmtree(task.cache_dir) (though we need to make sure that cache_dir reliably points to the right place) and we should be able to tag something like task.ephemeral = True.

@djarecka
Copy link
Collaborator

yes, that's a good idea, we should support removing the output after using it

@montesmariana
Copy link
Contributor Author

Great, thank you!! I look forward to that :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants