Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDFDocument readable stream not respecting highWatermark on _write (push) #269

Open
ashelley opened this issue Aug 11, 2014 · 12 comments
Open

Comments

@ashelley
Copy link

Hello,

One of the issues I'm having with pdfkit is that it doesn't respect the highwatermark of the underlying stream.

https://github.com/devongovett/pdfkit/blob/master/lib/document.coffee#L113

(in javascript)

    PDFDocument.prototype._write = function(data) {
      if (!Buffer.isBuffer(data)) {
        data = new Buffer(data + '\n', 'binary');
      }
      this.push(data); // <--- guilty line
      return this._offset += data.length;
    };

To fix this problem the code would have to look more like this:

    PDFDocument.prototype._write = function(data) {
      if (!Buffer.isBuffer(data)) {
        data = new Buffer(data + '\n', 'binary');
      }
      this._offset += data.length;
      var ok = this.push(data); 
      return ok;
    };

And callers of it need to look like this:

if(!document.write(chunk) {
   document.once('drain', function() {
     //continue writing chunks
  });
}

Unfortunately making document._write asynchronous requires a substantial patch to pdfkit to use callbacks because right now it relies on a lot of synchronous code.

One benefit of doing this fix is that you would end up getting the "document.end callback" as mentioned here:

#265

The reason I am having this problem is documented here:

nodejs/node-v0.x-archive#8048

It was hard to find documentation on this problem but i stumbled across this link showing that .push does indeed respect the highwatermark.

nodejs/node-v0.x-archive#4814

Also related to:

#258 (comment)
#264 (possibly)

@ashelley
Copy link
Author

Here's a link talking about the behavior of push in the stream documentation:

http://nodejs.org/api/stream.html#stream_readable_push_chunk_encoding

return Boolean Whether or not more pushes should be performed

@ashelley
Copy link
Author

It turns out that the utilization of the zlib.createDeflate stream in PDFKit might suffer the same flaw and that is the reason why compression is slow. See this related issue:

#264 (comment)

@ashelley
Copy link
Author

Please see this issue:

nodejs/node-v0.x-archive#6623

@devongovett
Copy link
Member

Hmm, well it would be pretty difficult to change how PDFKit does this. It would require changing the entire API to be asynchronous, and that's not really something I want to do. The node API is supposed to buffer content while the underlying stream is not ready. I see the open node issue about this that you linked to. It would be nice to get that fixed, but in the mean time if there is a patch we can apply to PDFKit directly that works around the issue without changing anything too big I'd be happy to take it.

Maybe instead of creating maybe Buffer objects for each call to doc._write, we could create one big Buffer the first time, and fill that buffer with each piece. Then when it fills up, we can empty everything in one go to the actual stream. That would have two benefits:

  1. We'd be creating a lot fewer small buffers, which would reduce memory usage and improve performance (buffer allocations are slow).
  2. We'd avoid writing as quickly to the stream, which might let the underlying file or whatever more time to keep up, and cause node to have to do less buffering.

What do you think of this idea?

@tomauty
Copy link

tomauty commented Nov 24, 2014

I didn't open the issue but if this is still under consideration I'm very much for it. I've been using PDFKit for batch PDF generation (which involves getting files from S3 and inlining them into the PDFs) and this library has been a godsend, but one issue I've run into is making sure the entire buffer is written before closing the streams and zipping up. The issue touched on in your second point is what has caused me to check _writeableState on a setInterval which ain't pretty :)

@ashelley
Copy link
Author

@devongovett I didn't realize you responded to this thread suggesting creating a larger buffer and flushing it less frequently to the underlying stream. I think this is a good idea.. I haven't had a chance to revisit this problem in many months... I'm sorta hoping the streams performance problem goes away with node 0.12 but your suggestion sounds totally sane. When i'm back dealing with pdf generation I will look into this suggestion... thank you!

EDIT: looking back at my notes i think i ended up observing problems even if i used something like this: https://github.com/ashelley/writable-filestream ... I remember it helping quite a bit but mostly just concealing the issue, crossing fingers that streams in 0.12 fix this issue

@ashelley
Copy link
Author

@tomauty

I think you should be able to wait for stream.on('finish') in your code to be able to figure out whether the buffer has been flushed or not but feel free to read on if you want to tame the behavior of what happens after calling document.end();

Okay, so I was looking at this problem again now that i'm up and running on 0.12.2 stable. I know the following doesn't specifically fix the problem with pushing onto the destination stream but it allows you to solve the main performance problem I am seeing with PDFKit. This method can be used to externally manage the performance problem without breaking PDFKit's api. Note that this approach doesn't fix the document._write path but I believe the performance problem with PDFKit is almost completely mitigated using the following strategy of controlling how we write to the deflate stream.

This solution also may be more more related to this issue I documented here:

#264

I am keeping the discussion going on this thread since we are talking about here coming up with a solution to the problem and I believe, at least in my testing, that this is the biggest performance problem.

I have come up with a non-intrusive patch that does not rely on affecting the PDFKit api.

ashelley@82bdfb6

What the patch does is broadcast to the event listenr all of the deflate streams (when you are using compression) that have hit their highWaterMark.

To demonstrate how this helps I've come up with a proof-of-concept code here:

https://gist.github.com/ashelley/a32b87c04e43608fe4e1

Unfortunately, to make the fix work you need to be aware of the issue and deal with it by structuring your rendering loop to be re entrant. Once you are able to re-enter your rendering loop you are able to take advantage of "pausing your writes" until the deflate stream can drain.

https://gist.github.com/ashelley/a32b87c04e43608fe4e1#file-pdfkit-perf-test-js-L66-L82

One thing to note is that the deflate can 'finish' without 'drain' ever being called. The above code attempts to deal with this situation by keeping track of when 'drain' has been called on a particular deflate stream instance and also when 'finish' has been called. There might be more nuances to this but I found that the above solution was working for me.

I've put several options in the test script so that you can use to control the various results as the problem is not apparent until your PDF document reaches a certain size. To show that the problem isn't specific to having the destination piped to the filesystem I've included a Nullstream that simply throws the data out.

Uncomment one of these two lines in the test script to switch between outputting to the filesystem and Nullstream so that you may view the output pdf after it has been generated:

https://gist.github.com/ashelley/a32b87c04e43608fe4e1#file-pdfkit-perf-test-js-L46-L47

Other settings include maxPages and maxLines (to control the size of the output document)

https://gist.github.com/ashelley/a32b87c04e43608fe4e1#file-pdfkit-perf-test-js-L12-L13

And the most important option, allowing the draining to occur:

https://gist.github.com/ashelley/a32b87c04e43608fe4e1#file-pdfkit-perf-test-js-L30

Here are the results, as you will see that as you increase the size of the pdf the total performance increase/decrease becomes evident.

  • Notice that with a smaller file size that the performance improvements aren't as noticeable.
NODE_VERSION =  v0.12.2 shouldLetDrain =  true maxLines =  50 maxPages =  500  using nullStream =  true
render time =  30.14 flush end =  0.03 total time =  30.17
NODE_VERSION =  v0.12.2 shouldLetDrain =  false maxLines =  50 maxPages =  500  using nullStream =  true
render time =  14.02 flush end =  20.35 total time =  34.38
NODE_VERSION =  v0.12.2 shouldLetDrain =  true maxLines =  70 maxPages =  500  using nullStream =  true
render time =  44.23 flush end =  0.02 total time =  44.25
NODE_VERSION =  v0.12.2 shouldLetDrain =  false maxLines =  70 maxPages =  500  using nullStream =  true
render time =  19.71 flush end =  29.33 total time =  49.04
  • If you increase the file size the results really are quite dramatic
NODE_VERSION =  v0.12.2 shouldLetDrain =  true maxLines =  70 maxPages =  1000  using nullStream =  true
render time =  88.87 flush end =  0.04 total time =  88.91
NODE_VERSION =  v0.12.2 shouldLetDrain =  false maxLines =  70 maxPages =  1000  using nullStream =  true
render time =  43.86 flush end =  71.19 total time =  115.04
  • The above shows the difference with the "NullStream". Please note that on my system, i found that I had more sporadic performance when not using "shouldLetDrain". I've pasted a somewhat median number from my system but I really had to run it a bunch of times to get a non-exadurated number.

In otherwords, when running the tight loop that doesn't let deflate drain be sure to run the test multiple times because the total time (flush time) can vary greatly. Also note that i found when using "shouldLetDrain" the performance was more consistent.

  • When running with an actual filestream I found that not only did the results of not using "shouldLetDrain" vary wildly, but the performance difference is quite noticeable. I think it is also worth noting that when using "shouldLetDrain" that the performance is acceptable. I believe ~100s on my system, based on my personal experience with pdf generation on other platforms, is a totally acceptable number.
NODE_VERSION =  v0.12.2 shouldLetDrain =  true maxLines =  70 maxPages =  1000  using nullStream =  false
render time =  100.48 flush end =  0.15 total time =  100.63
NODE_VERSION =  v0.12.2 shouldLetDrain =  false maxLines =  70 maxPages =  1000  using nullStream =  false
render time =  44.69 flush end =  341.98 total time =  386.66
  • Please note that i'm not using an SSD here. Some testing on other systems would be required. To confirm the benefits to this. I'm going to be trying this probably after I post to see how different systems behave.

Going forward, I want to attempt to try to fix the problem in pdfkit as I think it might be able to be done because of how when a "PdfReference" is ending it calls deflate.end:

https://github.com/devongovett/pdfkit/blob/master/lib/reference.coffee#L45

Which defers to calling .finalize when the stream ends:

https://github.com/devongovett/pdfkit/blob/master/lib/reference.coffee#L24

Based on this it might be possible to fix this code to inherently be nicer to the zlib stream due to the fact that it does the deferred continuation. I've made attempts to bring back the code from the older versions of pdfkit that "do this differently" as noted here:

#264

https://github.com/devongovett/pdfkit/blob/933aee7d6062c4f5c5a147c15ca3d999f224f0b9/lib/reference.coffee#L41-46

But I haven't come up with anything that works yet. Not sure how much longer I can work on this project but I just wanted to share some more information that I had and confirm that the problem wasn't fixed in node 0.12.2.

Also I would like to note that I think the compression stream is the biggest problem rather than not respecting the return value of .push. As noted by watching the _writableState.needDrain on your own destination stream you can control your own flow where as with the built in deflate stream in pdfkit it's not easily externally controllable without exposing the each deflate stream via the 'streamNeedsDrain' event as my patch does.

My conclusion is that without using a method to control the behavior of the deflate stream my performance is super sporadic when dealing with larger files, but this needs more testing.

@ashelley
Copy link
Author

Oops, maybe flag @devongovett in case they are interested

@ashelley
Copy link
Author

After doing some more testing I can confirm that the patch + workaround does indeed provide positive results. To test using an SSD i fired up a digital ocean 1GB computer and immediately ran into issues with OOM. To test use the provided script and set to " shouldLetDrain" false and constrict the v8 heap size to 1GB

I also set maxLines to 70 and maxPages to 1000.

  • without the workaround even with 1GB of heap, the process will not complete
node --max_old_space_size=1024 perf-test.js 
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Aborted (core dumped)
  • with the workaround you can run the 1000 page test in as little as 20MB of memory and still have it complete without thrashing the GC
node --max_old_space_size=20 perf-test.js 
calling document.end
NODE_VERSION =  v0.12.2 shouldLetDrain =  true maxLines =  70 maxPages =  1000  using nullStream =  false
render time =  159.61 flush end =  0.12 total time =  159.73

@mgendre
Copy link

mgendre commented Jan 20, 2016

@devongovett do you plan to implement a fix for this performance problem ?

We also encounter this performance problem to generate PDF files from our NodeJS server.

@acthp
Copy link

acthp commented Mar 15, 2016

Also hitting this. Compression speed is pathologically slow: it takes minutes to process a pdf of a couple meg. It's much faster to send on the wire uncompressed.

@jeff-r-koyaltech
Copy link

jeff-r-koyaltech commented Dec 22, 2023

Thanks for sharing this level of detail in the issue discussion. This helped my team to realize that if we provided our source images in JPEG instead of PNG, we could bypass the zlib / deflate performance issues, because PDFKit's JPEG code does not use zlib.
https://github.com/foliojs/pdfkit/blob/master/lib/image/jpeg.js

If your solution affords you a choice in your source image format, this can be a reason to make the switch! We were able to reduce an 18 page pdf generation from around 14 seconds to < 5 seconds. RAM usage was a lot more predictable as well. Our users will be much happier. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants