You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! First of I think paco is a very nice library and would like to help improve it. That said I have a particular problem: I need to download millions of images as fast as possible. I looked into these resources:
I like the API of paco.each but when testing it my computer froze as its memory blew up while trying to create 1 million coroutines. The main problem is in these lines of code:
It creates all the objects in memory before starting their tasks
It also assumes that the collection fits in memory
It also assumes that the collection is fast to iterate over
Preserves order (nice to have)
Since my problem speed and memory then 1 to 3 are more relevant. I recreated the map and each using asyncio.Queue and limiting the amount of tasks to exist at the same time. This involved creating and structure I called Stream that just holds a coroutine and a Queue. My API enforces the limit on each to not surpass that amount of objects in memory.
Both the new from_iterable and map functions have queue_maxsize parameter that further limits how the data flows and enforces a back-pressure mechanism. The code is at the end. I wanted to share the experiment and also open the possibility of creating a paco.stream module to continue the life of this code.
Hi! First of I think paco is a very nice library and would like to help improve it. That said I have a particular problem: I need to download millions of images as fast as possible. I looked into these resources:
Using
paco
my initial code was:I like the API of
paco.each
but when testing it my computer froze as its memory blew up while trying to create 1 million coroutines. The main problem is in these lines of code:I observe the following:
Since my problem speed and memory then 1 to 3 are more relevant. I recreated the
map
andeach
usingasyncio.Queue
and limiting the amount of tasks to exist at the same time. This involved creating and structure I calledStream
that just holds a coroutine and a Queue. My API enforces thelimit
oneach
to not surpass that amount of objects in memory.Both the new
from_iterable
andmap
functions havequeue_maxsize
parameter that further limits how the data flows and enforces a back-pressure mechanism. The code is at the end. I wanted to share the experiment and also open the possibility of creating apaco.stream
module to continue the life of this code.The text was updated successfully, but these errors were encountered: