How to scale NodeJS PDF parsing: lessons learned

In my last post in this series we covered developing a somewhat modular parsing logic that we can apply to standardised PDFs. To cap off the series I’ll be taking you through my experience learning how to scale that across multiple processes, and the lessons learned therein.

First, here’s the idea of the program architecture I had in mind last time:

Overall Program Architecture

It’s simple enough to build a context for the parsing module, but I realised when considering concurrency that there’s no element of parallelism in that diagram.

So now, let’s frame it in a way that helps us think about concurrency.

Dead Simple Concurrency

Trivial, I know, and arguably way too textbook-y generalised for us to practically use, but hey, it’s a fundamental concept to formalise.

Now first and foremost we need to think about how we’re going to handle the input and output of our program, which will essentially be wrapping the parsing logic and then distributing it amongst parser worker processes. There are many questions we can ask here, and many solutions:

  • is it going to be a command line application?

  • Is it going to be a consistent server, with a set of API endpoints? This has its own host of questions - REST or GraphQL, for example?

  • Maybe it’s just a skeleton module in a broader codebase - for example, what if we generalised our parsing across a suite of binary documents and wanted to separate the concurrency model from the particular source file type and parsing implementation?

For simplicity’s sake I’m going to wrap the parsing logic in a command-line utility. This means it’s time to make a bunch of assumptions:

  • does it expect file paths as input, and are they relative or absolute?

  • Or instead, does it expect concatenated PDF data, to be piped in?

  • Is it going to output data to a file? Because if it is, then we’re going to have to provide that option as an argument for the user to specify…

Handling Command Line Input

Again, keeping things as simple as possible: I’ve opted for the program to expect a list of file paths, either as individual command line arguments:

node index file-1.pdf file-2.pdf … file-n.pdf

Or piped to standard input as a newline-separated list of file paths:

# read lines from a text file with all our paths
cat files-to-parse.txt | node index
# or perhaps just list them from a directory
find ./data -name “*.pdf” | node index

This allows the Node process to manipulate the order of those paths in any way it sees fit, which allows us to scale the processing code later. To do this, we’re going to read the list of file paths, whichever way they were provided, and divvy them up by some arbitrary number into sub-lists. Here’s the code, the getTerminalInput method in ./input/index.js:

function getTerminalInput (subArrays) {

  return new Promise((resolve, reject) => {

    const output = [];
  
    if (process.stdin.isTTY) {

      const input = process.argv.slice(2);

      const len = Math.min(subArrays, Math.ceil(input.length / subArrays));

      while (input.length) {
        output.push(input.splice(0, len));
      }

      resolve(output);

    } else {
    
      let input = '';
      process.stdin.setEncoding('utf-8');

      process.stdin.on('readable', () => {
        let chunk;
        while (chunk = process.stdin.read())
          input += chunk;
      });

      process.stdin.on('end', () => {
        input = input.trim().split('\n');

        const len = Math.min(input.length, Math.ceil(input.length / subArrays));

        while (input.length) {
          output.push(input.splice(0, len));
        }

        resolve(output);
      })
    
    }
    
  });



}

Why divvy up the list? Let’s say that you have an 8-core CPU on consumer grade hardware, and 500 PDFs to parse.

Unfortunately for Node, even though it handles asynchronous code fantastically thanks to its event loop, it only runs on one thread. To process those 500 PDFs, if you’re not running multithreaded (i.e.: multiple process) code, you’re only using an eighth of your processing capacity. Assuming that memory efficiency isn’t a problem, you could process the data up to eight times faster by taking advantage of Node’s built in parallelism modules.

Splitting up our input into chunks allows us to do that.

As an aside, this is essentially a primitive load balancer, and clearly assumes that the workloads presented by parsing each PDF are interchangeable. That is, that the PDFs are the same size and hold the same structure.

This is obviously a trivial case, especially since we’re not taking into account error handling in worker processes and which worker is currently available to handle new loads. In the case where we would have set up an API server to handle incoming parsing requests, we would have to consider these extra needs.

Clustering our code

Now that we have our input split into manageable workloads, admittedly in a contrived way - I’d love to refactor this later - let’s go over how we can cluster it. So it turns out Node has two separate modules for setting up parallel code.

The one we’re going to use, the cluster module, basically allows a Node process to spawn copies of itself and balance processing between them as it sees fit.

This is built on top of the child_process module, which is less tightly coupled with parallelising Node programs themselves, and allows you to spawn other processes, like shell programs or another executable binary, and interface with them using standard input, output, et cetera.

I highly recommend reading through the API docs for each module, since they’re fantastically written, and even if you’re like me and find purposeless manual reading boring and total busy-work, at least familiarising yourself with the introductions to each module will help you ground yourself in the topic and expand your knowledge of the Node ecosystem.

So let’s walk through the code. Here it is in bulk:

const cluster = require('cluster');
const numCPUs = require('os').cpus().length;

const { getTerminalInput } = require('./input');

(async function main () {

  if (cluster.isMaster) {

    const workerData = await getTerminalInput(numCPUs);

    for (let i = 0; i < workerData.length; i++) {

      const worker = cluster.fork();
      const params = { filenames: workerData[i] };

      worker.send(params);

    }

  } else {

    require('./worker');

  }

})();

So our dependencies are pretty simple. First there’s the cluster module as described above. Second, we’re requiring the os module for the express purpose of figuring out how many CPU cores there are on our machine – which is a fundamental parameter of splitting up our workload. Finally, there’s our input handling function which I’ve outsourced to another file for completeness’ sake.

Now the main method is actually rather simple. In fact we could break it down into steps:

  1. If we’re the main process, split up the input sent to us evenly per the number of CPU cores for this machine

  2. For each worker-to-be’s load, spawn a worker by cluster.fork and set up an object which we can send to it by the [cluster] module’s inter-process RPC message channel, and send the damn thing to it.

  3. If we’re not in fact the main module, then we must be a worker – just run the code in our worker file and call it a day.

Nothing crazy is going on here, and it allows us to focus on the real lifting, which is figuring out how the worker is going to use the list of filenames we give to it.

Messaging, Async, and Streams, all the elements of a nutritious diet

First as above let me dump the code for you to refer to. Trust me, looking through it first will let you skip through any explanation you’d consider trivial.

const Bufferer = require('../bufferer');
const Parser = require('../parser');
const { createReadStream } = require('fs');

process.on('message', async (options) => {

  const { filenames } = options;
  const parser = new Parser();

  const parseAndLog = async (buf) => console.log(await parser.parse(buf) + ',');

  const parsingQueue = filenames.reduce(async (result, filename) => {

    await result;

    return new Promise((resolve, reject) => {

      const reader = createReadStream(filename);
      const bufferer = new Bufferer({ onEnd: parseAndLog });

      reader
        .pipe(bufferer)
        .once('finish', resolve)
        .once('error', reject)
    
    });
  
  }, true);

  try {
    await parsingQueue;
    process.exit(0);
  } catch (err) {
    console.error(err);
    process.exit(1);
  }

});

Now there are some dirty hacks in here so be careful if you’re one of the uninitiated (only joking). Let’s look at what happens first:

Step one is to require all the necessary ingredients. Mind you, this is based on what the code itself does. So let me just say we’re going to use a custom-rolled Writable stream I’ve endearingly termed Bufferer, a wrapper for our parsing logic from last time, also intricately named, Parser, and good old reliable createReadStream from the fs module.

Now here’s where the magic happens. You’ll notice that nothing’s actually wrapped in a function. The entire worker code is just waiting for a message to come to the process – the message from its master with the work it has to do for the day. Excuse the medieval language.

So we can see first of all that it’s asynchronous. First we extract the filenames from the message itself – if this were production code I’d be validating them here. Actually, hell, I’d be validating them in our input processing code earlier. Then we instantiate our parsing object – only one for the whole process – this is so we can parse multiple buffers with one set of methods. A concern of mine is that it’s managing memory internally, and on reflection this is a good thing to review later.

Then there’s a simple wrapper, parseAndLog around parsing that logs the JSON-ified PDF buffer with a comma appended to it, just to make life easier for concatenating the results of parsing multiple PDFs.

A humourous GIF.

Your worker, primed and ready for a date with destiny.

Finally the meat of the matter, the asynchronous queue. Let me explain:

This worker’s received its list of filenames. For each filename (or path, really), we need to open a readable stream through the filesystem so we can get the PDF data. Then, we need to spawn our Bufferer, (our waiter, following along from the restaurant analogy earlier), so we can transport the data to our Parser.

The Bufferer is custom-rolled. All it really does is accept a function to call when it’s received all the data it needs – here we’re just asking it to parse and log that data.

So, now we have all the pieces, we just pipe them together:

  1. The readable stream – the PDF file, pipes to the Bufferer

  2. The Bufferer finishes, and calls our worker-wide parseAndLog method

This entire process is wrapped in a Promise, which itself is returned to the reduce function it sits inside. When it resolves, the reduce operation continues.

This asynchronous queue is actually a really useful pattern, so I’ll cover it in more detail in my next post, which will probably be more bite-sized than the last few.

Anyway, the rest of the code just ends the process based on error-handling. Again, if this were production code, you can bet there’d be more robust logging and error handling here, but as a proof of concept this seems alright.

So it works, but is it useful?

So there you have it. It was a bit of a journey, and it certainly works, but like any code it’s important to review what its strengths and weaknesses are. Off the top of my head:

  • Streams have to be piled up in buffers, which unfortunately defeats the purpose of using streams, and memory efficiency suffers accordingly. This is a necessary duct-tape-fit to work with the pdfreader module. I’d love to see if there’s a way to stream PDF data and parse it on a finer-grained level, especially if modular, functional parsing logic can still be applied to it.

  • In this baby stage the parsing logic is also annoyingly brittle. Just think, what if I have a document that’s longer than a page? A bunch of assumptions fly out the window, and make the need for streaming PDF data even stronger.

If you’ve got any specific criticisms or concerns I’d love to hear them too, since spotting weaknesses in code is the first step to fixing them. And, if you’re aware of any better method to streaming and parsing PDFs concurrently, let me know so I can leave it here for anyone reading through this post for an answer. Either way – or for any other purpose – send me an email or get in touch on Reddit. Thanks for reading as well, I hope you learned as much as I did.

References

[1] The previous (second) blog post in this series
[2] This project’s companion repo: /input/index.js
[3] The NodeJS API Cluster module docs
[4] The NodeJS API Child Process module docs
[5] The NodeJS API OS module docs
[6] The NodeJS API File System module docs
[7] Email me!
[8] Get in touch on Reddit


2104 Words

2019-03-22 00:00 +0000