How to parse hundreds of PDFs in a blink with NodeJS Streams

Take a step into program architecture, and learn how to make a practical solution for a real business problem with NodeJS Streams with this post.


1. A Detour: Fluid Mechanics
2. The Situation
3. The Project
  a. Basic Architecture: Constraints
  b. Basic Architecture: Solutions
4. To Be Continued…

A Detour: Fluid Mechanics

One of the greatest strengths of software is that we can develop abstractions which let us reason about code, and manipulate data, in ways we can understand. Streams are one such class of abstraction.

In simple fluid mechanics, the concept of a streamline is useful for reasoning about the way fluid particles will move, and the constraints applied to them at various points in a system.

For example, say you’ve got some water flowing through a pipe uniformly. Halfway down the pipe it branches. Generally, the water flow will split evenly into each branch. Engineers use the abstract concept of a streamline to reason about the water’s properties, such as its flow rate, for any number of branches or complex pipeline configurations. If you asked an engineer what he assumed the flow rate through each branch would be, he would rightly reply with “one half”, intuitively. This expands out to an arbitrary number of streamlines mathematically.

Streams, conceptually, are to code what streamlines are to fluid mechanics. We can reason about data at any given point by considering it as part of a flow, rather than worrying about implementation details between how it’s stored. Arguably you could generalise this to some universal concept of a pipeline that we can use between disciplines - a sales funnel comes to mind - but that’s tangential and we’ll cover it later. The best example of streams, and one you absolutely must familiarise yourself with if you haven’t already, are UNIX pipes:

cat server.log | grep 400 | less

We affectionately call the | character a pipe, based on its function: we’re piping the output of one program as the input of another program, effectively setting up a pipeline.

(Also, it looks like a pipe.)

If you’re like me and wonder at this point why this is necessary, ask yourself why we use pipelines in real life. Fundamentally, it’s a structure that eliminates storage between processing points - we don’t need to worry about storing barrels of oil if it’s pumped.

Go figure that in software, the clever developers and engineers who wrote the code for piping data set it up such that it never occupies too much memory on a machine. No matter how big the logfile is above, it won’t hang the terminal, because the entire program is a process handling infinitesimal data points in a stream, rather than containers of those points. The logfile never gets loaded into memory all at once, but rather in manageable parts.

I don’t want to reinvent the wheel here, so now that I’ve covered a metaphor for streams and the rationale for using them, Flavio Copes has a great blog post covering how they’re implemented in Node. Take as long as you need to cover the basics there, and when you’re ready come back and we’ll go over a use case.

The Situation

So, now that you’ve got this tool in your toolbelt, picture this:

You’re on the job and your manager / legal / HR / your client / (insert stakeholder here) has approached you with a problem: they spend way too long poring over structured PDFs. Of course, normally people won’t tell you such a thing. You’ll hear, “I spend 4 hours doing data entry”, or, “I look through price tables”, or, “I fill out the right forms so we get our company branded pencils every quarter”.

Whatever it is, if their work happens to involve both (a) the reading of structured PDF documents and (b) the bulk usage of that structured information, then you can step in and say, “Hey, we might be able to automate that and free up your time to work on other things”.

So for the sake of this post, let’s come up with a dummy company. Where I come from, the term “dummy” refers to either an idiot or a baby’s pacifier, so let’s imagine up this fake company that manufactures pacifiers. While we’re at it let’s jump the shark and say they’re 3D printed, and the company operates as an ethical supplier of pacifiers to the needy who can’t afford the premium stuff themselves.

(I know how dumb it sounds, suspend your disbelief please.)

Todd sources the printing materials that go into DummEth’s products, and has to ensure that they meet three key criteria:

  • they’re food-grade plastic, to preserve babies’ health,

  • they’re cheap, for economical production, and

  • they’re sourced as closely as possible, to support the company’s marketing copy stating that their supply chain is also ethical and pollutes as little as possible.

The Project

So it’s easier to follow along, I’ve set up a GitLab repo you can clone and use. Make sure your installations of Node and NPM are up to date too.

Basic Architecture: Constraints

Now, what are we trying to do? Let’s assume that Todd works well in spreadsheets, like a lot of office workers. For Todd to sort the proverbial 3D printing wheat from the chaff, it’s easier for him to gauge materials by food grade, price per kilogram, and location. It’s time to set some project constraints.

Let’s assume that a material’s food grade is rated on a scale from zero to three, with zero meaning banned-in-California BPA-rich plastics, and three meaning commonly used non-contaminating materials, like low density polyethylene. This is purely to simplify our code and in reality we’d have to somehow map textual descriptions of these materials (e.g.: “LDPE”) to a food grade.

Price per kilogram we can assume to be a property of the material given by its manufacturer.

Location, we’re going to simplify, and assume to be a simple relative distance, as the crow flies. At the opposite end of the spectrum there’s the overengineered solution: using some API (e.g.: Google Maps) to discern the rough travel distance a given material would travel to reach Todd’s distribution center(s). Either way let’s say we’re given it as a value (kilometres-to-Todd) in Todd’s PDFs.

Also, let’s consider the context we’re working in. Todd effectively operates as an information gatherer in a dynamic market. Products come in and out, and their details can change. This means we’ve got an arbitrary number of PDFs that can change – or more aptly, be updated – at any time.

So based on these constraints, we can finally figure out what we want our code to accomplish. If you’d like to test your design ability, pause here and consider how you’d structure your solution. It might not look the same as what I’m about to describe, and that’s fine, as long as you’re providing a sane workable solution for Todd, and something you wouldn’t tear your hair out later trying to maintain.

Basic Architecture: Solutions

So we’ve got an arbitrary number of PDFs, and some rules for how to parse them. Here’s how we can do it:

  1. Set up a Stream object that can read from some input, like a HTTP client requesting PDF downloads, or a module we’ve written that reads PDF files from a directory in the file system.

  2. Set up an intermediary Buffer. This is like the waiter in a restaurant delivering a finished dish to its intended customer; every time a full PDF gets passed into the stream, we flush those chunks into the buffer so it can be transported.

  3. The waiter (Buffer) delivers the food (PDF data) to the customer (our Parsing function), who does what they please (convert to some spreadsheet format) with it.

  4. When the customer (Parser) is done, let the waiter (Buffer) know that they’re free and can work on new orders (PDFs).

You’ll notice that there’s no clear end to this process. Like a restaurant, our Stream-Buffer-Parser combo never finishes, until of course there’s no more data – no more orders – coming in.

To Be Continued…

Now I know there’s not a lick of code just yet. This is crucial. It’s important to be able to reason about our systems prior to writing them. Now, we won’t get everything right the first time even with a priori reasoning. Things always break in the wild. Bugs need to be fixed.

That said, it’s a powerful exercise in restraint and foresight to plan out your code prior to writing it, and if you can simplify systems of increasing complexity into manageable parts and analogies, you’ll be able to increase your productivity exponentially, as the cognitive stress from those complexities fades into well designed abstractions.

I’ll follow up with an implementation of the architecture above in a later post. If you like this style of posting – conceptualising software – please let me know with an email or get in touch on Reddit. The feedback is great and lets me tailor my writing to things that people actually find useful.


[1] Wikipedia: Streamlines
[2] Wikipedia: Pipelines
[3] Flavio Copes: NodeJS Streams
[4] GitLab: example repo for this project
[5] Node API: Buffer
[6] Email me!
[7] Get in touch on Reddit

1542 Words

2019-02-15 00:00 +0000