How to parse structured PDFs with NodeJS

A fortnight ago I wrote a blog post detailing the architecture for our Stream-based PDF parser. Today, we’re concerned with the parsing module that will process the PDFs themselves. So in the grand scheme of things, it looks something like this:

Overall Program Architecture

Introducing Dependencies

Now as a disclaimer, I should add that there is a whole world of thought around introducing dependencies into your code. I’d love to cover this concept in another post, but in the meantime let me just say that one of the fundamental conflicts at play is the one between our desire to get our work done quickly (i.e.: to avoid NIH syndrome), and our desire to avoid third-party risk.

Applying this to our project, I opted to offload the bulk of our PDF processing to the pdfreader module. Here are a few reasons why:

  • It was published recently, which is a good sign that the repo is up-to-date.

  • It has one dependency - that is, it’s just an abstraction over another module - which is regularly maintained on GitHub. This alone is a great sign. Moreover, the dependency, a module called pdf2json, has hundreds of stars, 22 contributors, and plenty of eyeballs keeping a close eye on it.

  • The maintainer, Adrian Joly, does good bookkeeping in GitHub’s issue tracker and actively tends to users and developers’ questions.

  • When auditing via NPM (6.4.1), no vulnerabilities are found.

So all in all, it seems like a safe dependency to include.

Now, the module works in a fairly straightforward way, although its README doesn’t explicitly describe the structure of its output. The cliff notes:

  1. It exposes the PdfReader class to be instantiated

  2. This instance has two methods for parsing a PDF. They return the same output and only differ in the input: PdfReader.parseFileItems for a filename, and PdfReader.parseBuffer from data that we don’t want to reference from the filesystem.

  1. The methods ask for a callback, which gets called each time the PdfReader finds what it denotes as a PDF item. There are three kinds: first, the file metadata, which is always the first item; page metadata, which basically acts as a carriage return for the coordinates of text items to be processed; and text items, which we can think of as simple objects / structs with a text property, and floating-point 2D AABB coordinates on the page.

  2. It’s up to our callback to process these items into a data structure of our choice, and also to handle any errors thrown to it.

Here’s a code snippet as an example:

const { PdfReader } = require('pdfreader');

// Initialise the reader
const reader = new PdfReader();

// Read some arbitrarily defined buffer
reader.parseBuffer(buffer, (err, item) => {

  if (err)
    console.error(err);

  else if (!item)
    /* pdfreader queues up the items in the PDF and passes them to
     * the callback. When no item is passed, it's indicating that
     * we're done reading the PDF. */
    console.log('Done.');

  else if (item.file)
    // File items only reference the PDF's file path.
    console.log(`Parsing ${item.file && item.file.path || 'a buffer'}`)

  else if (item.page)
    // Page items simply contain their page number.
    console.log(`Reached page ${item.page}`);

  else if (item.text) {

    // Text items have a few more properties:
    const itemAsString = [
      item.text,
      'x: ' + item.x,
      'y: ' + item.y,
      'w: ' + item.width,
      'h: ' + item.height,
    ].join('\n\t');

    console.log('Text Item: ', itemAsString);

  }

});

Todd’s PDFs

Let’s return to the Todd situation, just to provide some context. We want to store the data pacifiers based on three key criteria:

  • their food-grade, to preserve babies’ health,

  • their cost, for economical production, and

  • their distance to Todd, to support the company’s marketing copy stating that their supply chain is also ethical and pollutes as little as possible.

I’ve hardcoded a simple script that randomises some dummy products, and you can find it in the /data directory of the companion repo for this project. That script writes that randomised data to JSON files.

There’s also a template document in there. If you’re familiar with templating engines like Handlebars, then you’ll understand this. There are online services - or if you’re feeling adventurous, you can roll your own - that take JSON data and fill in the template, and give it back to you as a PDF. Maybe for completeness’ sake we can try that out in another project. Anyway: I’ve used such a service to generate the dummy PDFs we’ll be parsing.

Here’s what one looks like (extra whitespace has been cropped out):

Dummy PDF

We’d like to yield from this PDF some JSON that gives us:

  • the requisition ID and date, for bookkeeping purposes,
  • the SKU of the pacifier, for unique identification, and
  • the pacifier’s properties (name, food grade, unit price, and distance), so Todd can actually use them in his work.

How do we do this?

Reading the Data

First let’s set up the function for reading data out of one of these PDFs, and extracting pdfreader’s PDF items into a usable data structure. For now, let’s have an array representing the document. Each item in the array is an object representing a collection of all text elements on the page at that object’s index; each property in the page object has a y-value for its key, and an array of the text items found at that y-value for its value. Here’s the diagram, so it’s simpler to understand:

Our Raw PDF Data Structure

The readPDFPages function in /parser/index.js handles this, similarly to the example code written above:

/* Accepts a buffer (e.g.: from fs.readFile), and parses
 * it as a PDF, giving back a usable data structure for
 * application-specific, second-level parsing.
 */
function readPDFPages (buffer) {
  const reader = new PdfReader();

  // We're returning a Promise here, as the PDF reading
  // operation is asynchronous.
  return new Promise((resolve, reject) => {

    // Each item in this array represents a page in the PDF
    let pages = [];

    reader.parseBuffer(buffer, (err, item) => {

      if (err)
        // If we've got a problem, eject!
        reject(err)

      else if (!item)
        // If we're out of items, resolve with the data structure
        resolve(pages);

      else if (item.page)
        // If the parser's reached a new page, it's time to
        // work on the next page object in our pages array.
        pages.push({});

      else if (item.text) {

        // If we have NOT got a new page item, then we need
        // to either retrieve or create a new "row" array
        // to represent the collection of text items at our
        // current Y position, which will be this item's Y
        // position.

        // Hence, this line reads as,
        // "Either retrieve the row array for our current page,
        //  at our current Y position, or make a new one"
        const row = pages[pages.length-1][item.y] || [];

        // Add the item to the reference container (i.e.: the row)
        row.push(item.text);

        // Include the container in the current page
        pages[pages.length-1][item.y] = row;

      }

    });
  });

}

So now passing a PDF buffer into that function, we’ll get some organised data. Here’s what I got from a test run, and printing it to JSON:

[ { '3.473': [ 'PRODUCT DETAILS REQUISITION' ],
    '4.329': [ 'Date: 23/05/2019' ],
    '5.185': [ 'Requsition ID: 298831' ],
    '6.898': [ 'Pacifier Tech', 'Todd Lerr' ],
    '7.754': [ '123 Example Blvd', 'DummEth Pty. Ltd.' ],
    '8.61': [ 'Timbuktu', '1337 Leet St' ],
    '12.235': [ 'SKU', '6308005' ],
    '13.466': [ 'Product Name', 'Square Lemon Qartz Pacifier' ],
    '14.698': [ 'Food Grade', '3' ],
    '15.928999999999998': [ '$ / kg', '1.29' ],
    '17.16': [ 'Location', '55' ] } ]

If you look carefully you’ll notice that there’s a spelling error in the original PDF: “Requisition” is misspelled as “Requsition”. The beauty of our parser is that we don’t particularly care for errors like these in our input documents. As long as they’re structured correctly, we can extract data from them accurately.

Now we just need to organise this into something a bit more usable (as if we’d expose it via API). The structure we’re looking for is something along the lines of this:

{
  reqID: '000000',
  date: 'DD/MM/YYYY', // Or something else based on geography
  sku: '000000',
  name: 'Some String We Have Trimmed',
  foodGrade: 'X',
  unitPrice: 'D.CC',  // D for Dollars, C for Cents
  location: 'XX',
}

An Aside: Data Integrity

Why are we including the numbers as strings? It’s based on the risk of parsing. Let’s just say that we coerced all of our numbers to strings:

The unit price and location would be fine - they are supposed to be countable numbers after all.

The food grade, for this very limited project, technically is safe - no data gets lost when we coerce it - but if it’s effectively a classifier, like an Enum, so it’s better off kept as a string.

The requisition ID and SKU however, if coerced to strings, could lose important data. If the ID for a given requisition starts with three zeros and we coerce that to a number, well, we’ve just lost those zeros and we’ve garbled the data.

So because we want data integrity when reading the PDFs, we just leave everything as a String. If the application code wants to convert some fields to numbers to make them usable for arithmetic or statistical operations, then we’ll let the coercion occur at that layer. Here we just want something that parses PDFs consistently and accurately.

Restructuring the Data

So now we’ve got Todd’s information, we just need to organise it in a usable way. We can use a variety of array and object manipulation functions, and here MDN is your friend.

This is the step where everyone has their own preferences, some preferring the method that just gets the job done and minimises dev time, and others preferring to scout for the best algorithm for the job (e.g.: cutting down iteration time). It’s a good exercise to see if you can come up with a way to do this and compare it to what I got. I’d love to see better, simpler, faster, or even just different ways to accomplish the same goal.

Anyway, here’s how I did it: the parseToddPDF function in /parser/index.js.

function parseToddPDF (pages) {

  const page = pages[0]; // We know there's only going to be one page

  // Declarative map of PDF data that we expect, based on Todd's structure
  const fields = {
    // "We expect the reqID field to be on the row at 5.185, and the
    //  first item in that array"
    reqID: { row: '5.185', index: 0 },
    date: { row: '4.329', index: 0 },
    sku: { row: '12.235', index: 1 },
    name: { row: '13.466', index: 1 },
    foodGrade: { row: '14.698', index: 1 },
    unitPrice: { row: '15.928999999999998', index: 1 },
    location: { row: '17.16', index: 1 },
  };

  const data = {};

  // Assign the page data to an object we can return, as per
  // our fields specification
  Object.keys(fields)
    .forEach((key) => {

      const field = fields[key];
      const val = page[field.row][field.index];

      // We don't want to lose leading zeros here, and can trust
      // any application / data handling to worry about that. This is
      // why we don't coerce to Number.
      data[key] = val;

    });

  // Manually fixing up some text fields so they're usable
  data.reqID = data.reqID.slice('Requsition ID: '.length);
  data.date = data.date.slice('Date: '.length);

  return data;

}

The meat and potatoes here is in the forEach loop, and how we’re using it. After retrieving the Y positions of each text item previously, it’s simple to specify each field we want as a position in our pages object; effectively providing a map to follow.

All we have to do then is declare a data object to output, iterate over each field we specified, follow the route as per our spec, and assign the value we find at the end to our data object.

After a few one-liners to tidy up some string fields, we can return the data object and we’re off to the races. Here’s what it looks like:

{ reqID: '298831',
  date: '23/05/2019',
  sku: '6308005',
  name: 'Square Lemon Qartz Pacifier',
  foodGrade: '3',
  unitPrice: '1.29',
  location: '55' }

Putting it all together

Architecture Recap

In a fortnight we’ll move on to supercharging this parsing module with our stream-buffer architecture, as per the first post. I really hope you enjoyed this read. If you’ve got any concerns or questions – if you spotted an error – let me know by email, or get in touch on Reddit. I’d love to hear what you think about this post and what I can do to help both it, and you, out.

References

[1] The previous blog post in this series
[2] Wikipedia: NIH Syndrome
[3] A PDF intro to third party risk
[4] NPM: pdfreader
[5] Adrien Joly’s GitHub page
[6] StackOverflow: “What is AABB - Collision detection?”
[7] This project’s companion repo (part-2 branch): /data folder
[8] The Handlebars project
[9] This project’s companion repo (part-2 branch): /parser/index.js
[10] MDN: JavaScript Reference
[11] Email me!
[12] Get in touch on Reddit


2152 Words

2019-02-27 00:00 +0000