Nat TaylorBlog, AI, Product Management & Tinkering

Making PDF Newsletters Accessible

Published on .

As the Rhodes 19 Fleet 5 webmaster, I was faced with doing something with over a decade of PDF newsletters. I decided the best thing to do was to extract and structure the content, then put it into WordPress so that I was easy to find, access, archive and index.

The newsletters were produced in Word, then saved as PDF and used multiple columns, making it basically impossible to do this in a completely automated way. But, there had to be something than adding the content post-by-post into WordPress.

Complete Automation? Nope.

All of the content was stored in the PDFs as text, so I figured it must be possible to extract it. So, I used PDFSam to merge them into a mega-PDF, then trying producing HTML with c:\bin\pdftohtml.exe -i -noframes -nomerge all.pdf andvoila I had a giant HTML file. Unfortunately, this is where I abandoned this technique. It didn’t pick up headers and instead just made them <b>Heading</b>, which wouldn’t do. Not only that, but it left everything riddled with everywhere.

Semi-automation? Yep.

I wondered “If I structure the content, would it be easy to get into WordPress?” Some quick Google-ing turned up WP All Importwhich could take XML and create posts. Sweet.

So, if I could get the content into XML, then it would be easy to that into WordPress. And that is just what I did, using the following format.

<!--?xml version="1.0" encoding="iso-8859-1"?-->

Conclusions

Taking print formatted content (e.g. with columns) and turning it into web formatted content is about as hard as it sounds. Relying on software to extract the content with human intervention is unreliable at best. Even if there was software that could use a user-defined template to know where to look for stuff, then I’d be at the mercy of 10 years of class presidents using slightly different formats. So, my final solution seems like a reasonable balance of simplicity and automation.

  1. Install the plugin WP All Import
  2. Structure the PDF content as XML
  3. Import the XML with the plugin.

> Note: At first I wrestled with UTF-8 for a bit before landing on iso-8859-1.

Post Navigation

«
»