-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDFDocument readable stream not respecting highWatermark on _write (push) #269
Comments
Here's a link talking about the behavior of push in the stream documentation: http://nodejs.org/api/stream.html#stream_readable_push_chunk_encoding
|
It turns out that the utilization of the zlib.createDeflate stream in PDFKit might suffer the same flaw and that is the reason why compression is slow. See this related issue: |
Please see this issue: |
Hmm, well it would be pretty difficult to change how PDFKit does this. It would require changing the entire API to be asynchronous, and that's not really something I want to do. The node API is supposed to buffer content while the underlying stream is not ready. I see the open node issue about this that you linked to. It would be nice to get that fixed, but in the mean time if there is a patch we can apply to PDFKit directly that works around the issue without changing anything too big I'd be happy to take it. Maybe instead of creating maybe
What do you think of this idea? |
I didn't open the issue but if this is still under consideration I'm very much for it. I've been using PDFKit for batch PDF generation (which involves getting files from S3 and inlining them into the PDFs) and this library has been a godsend, but one issue I've run into is making sure the entire buffer is written before closing the streams and zipping up. The issue touched on in your second point is what has caused me to check |
@devongovett I didn't realize you responded to this thread suggesting creating a larger buffer and flushing it less frequently to the underlying stream. I think this is a good idea.. I haven't had a chance to revisit this problem in many months... I'm sorta hoping the streams performance problem goes away with node 0.12 but your suggestion sounds totally sane. When i'm back dealing with pdf generation I will look into this suggestion... thank you! EDIT: looking back at my notes i think i ended up observing problems even if i used something like this: https://github.com/ashelley/writable-filestream ... I remember it helping quite a bit but mostly just concealing the issue, crossing fingers that streams in 0.12 fix this issue |
I think you should be able to wait for stream.on('finish') in your code to be able to figure out whether the buffer has been flushed or not but feel free to read on if you want to tame the behavior of what happens after calling document.end(); Okay, so I was looking at this problem again now that i'm up and running on 0.12.2 stable. I know the following doesn't specifically fix the problem with pushing onto the destination stream but it allows you to solve the main performance problem I am seeing with PDFKit. This method can be used to externally manage the performance problem without breaking PDFKit's api. Note that this approach doesn't fix the document._write path but I believe the performance problem with PDFKit is almost completely mitigated using the following strategy of controlling how we write to the deflate stream. This solution also may be more more related to this issue I documented here: I am keeping the discussion going on this thread since we are talking about here coming up with a solution to the problem and I believe, at least in my testing, that this is the biggest performance problem. I have come up with a non-intrusive patch that does not rely on affecting the PDFKit api. What the patch does is broadcast to the event listenr all of the deflate streams (when you are using compression) that have hit their highWaterMark. To demonstrate how this helps I've come up with a proof-of-concept code here: https://gist.github.com/ashelley/a32b87c04e43608fe4e1 Unfortunately, to make the fix work you need to be aware of the issue and deal with it by structuring your rendering loop to be re entrant. Once you are able to re-enter your rendering loop you are able to take advantage of "pausing your writes" until the deflate stream can drain. https://gist.github.com/ashelley/a32b87c04e43608fe4e1#file-pdfkit-perf-test-js-L66-L82 One thing to note is that the deflate can 'finish' without 'drain' ever being called. The above code attempts to deal with this situation by keeping track of when 'drain' has been called on a particular deflate stream instance and also when 'finish' has been called. There might be more nuances to this but I found that the above solution was working for me. I've put several options in the test script so that you can use to control the various results as the problem is not apparent until your PDF document reaches a certain size. To show that the problem isn't specific to having the destination piped to the filesystem I've included a Nullstream that simply throws the data out. Uncomment one of these two lines in the test script to switch between outputting to the filesystem and Nullstream so that you may view the output pdf after it has been generated: https://gist.github.com/ashelley/a32b87c04e43608fe4e1#file-pdfkit-perf-test-js-L46-L47 Other settings include maxPages and maxLines (to control the size of the output document) https://gist.github.com/ashelley/a32b87c04e43608fe4e1#file-pdfkit-perf-test-js-L12-L13 And the most important option, allowing the draining to occur: https://gist.github.com/ashelley/a32b87c04e43608fe4e1#file-pdfkit-perf-test-js-L30 Here are the results, as you will see that as you increase the size of the pdf the total performance increase/decrease becomes evident.
In otherwords, when running the tight loop that doesn't let deflate drain be sure to run the test multiple times because the total time (flush time) can vary greatly. Also note that i found when using "shouldLetDrain" the performance was more consistent.
Going forward, I want to attempt to try to fix the problem in pdfkit as I think it might be able to be done because of how when a "PdfReference" is ending it calls deflate.end: https://github.com/devongovett/pdfkit/blob/master/lib/reference.coffee#L45 Which defers to calling .finalize when the stream ends: https://github.com/devongovett/pdfkit/blob/master/lib/reference.coffee#L24 Based on this it might be possible to fix this code to inherently be nicer to the zlib stream due to the fact that it does the deferred continuation. I've made attempts to bring back the code from the older versions of pdfkit that "do this differently" as noted here: But I haven't come up with anything that works yet. Not sure how much longer I can work on this project but I just wanted to share some more information that I had and confirm that the problem wasn't fixed in node 0.12.2. Also I would like to note that I think the compression stream is the biggest problem rather than not respecting the return value of .push. As noted by watching the _writableState.needDrain on your own destination stream you can control your own flow where as with the built in deflate stream in pdfkit it's not easily externally controllable without exposing the each deflate stream via the 'streamNeedsDrain' event as my patch does. My conclusion is that without using a method to control the behavior of the deflate stream my performance is super sporadic when dealing with larger files, but this needs more testing. |
Oops, maybe flag @devongovett in case they are interested |
After doing some more testing I can confirm that the patch + workaround does indeed provide positive results. To test using an SSD i fired up a digital ocean 1GB computer and immediately ran into issues with OOM. To test use the provided script and set to " shouldLetDrain" false and constrict the v8 heap size to 1GB I also set maxLines to 70 and maxPages to 1000.
|
@devongovett do you plan to implement a fix for this performance problem ? We also encounter this performance problem to generate PDF files from our NodeJS server. |
Also hitting this. Compression speed is pathologically slow: it takes minutes to process a pdf of a couple meg. It's much faster to send on the wire uncompressed. |
Thanks for sharing this level of detail in the issue discussion. This helped my team to realize that if we provided our source images in JPEG instead of PNG, we could bypass the zlib / deflate performance issues, because PDFKit's JPEG code does not use zlib. If your solution affords you a choice in your source image format, this can be a reason to make the switch! We were able to reduce an 18 page pdf generation from around 14 seconds to < 5 seconds. RAM usage was a lot more predictable as well. Our users will be much happier. :) |
Hello,
One of the issues I'm having with pdfkit is that it doesn't respect the highwatermark of the underlying stream.
https://github.com/devongovett/pdfkit/blob/master/lib/document.coffee#L113
(in javascript)
To fix this problem the code would have to look more like this:
And callers of it need to look like this:
Unfortunately making document._write asynchronous requires a substantial patch to pdfkit to use callbacks because right now it relies on a lot of synchronous code.
One benefit of doing this fix is that you would end up getting the "document.end callback" as mentioned here:
#265
The reason I am having this problem is documented here:
nodejs/node-v0.x-archive#8048
It was hard to find documentation on this problem but i stumbled across this link showing that .push does indeed respect the highwatermark.
nodejs/node-v0.x-archive#4814
Also related to:
#258 (comment)
#264 (possibly)
The text was updated successfully, but these errors were encountered: