Accessible Document Converter Solution

VerseOne's Accessible Document Converter (ADC) is designed to make it easier to convert PDF and Word documents to Accessible HTML web pages. ADC is not designed to faithfully display heavily designed documents as though viewing in a graphic design application or even a PDF — this is why the module includes a link to download the original document.

Although continually being improved, HTML is fundamentally more limited than even most Word Processors, e.g. there is no concept of "tiered numbering" in lists: we can make it look as though there are by using CSS, but this will not help those using screen-readers, for instance.

How the converter works

Our conversions are provided by two services: PDF to Word uses Adobe’s professional PDF Services API, and the Word to HTML is handled by a library called PanDoc. So, a PDF will go through Adobe first, and then through PanDoc: Word only goes through PanDoc.

Once we receive the HTML, we can do a number of transforms ourselves — to make up for some translations errors where possible, and to ensure that we don’t have multiple identical images, etc. This does allow us to finesse some elements of the HTML, provided we have some way of determining the original data.

Although we are constantly trying to improve the module, some elements are beyond our control. The below outlines known issues with the ADC, short-term workarounds, and any development progress that we have made or are researching.

Back to the list

Graphs

Graphs from Word / PDF documents do not render properly: this is unlikely to be fixed. This describes a workaround.

Converting your documents

Ran a customer-supplied PDF through the Adobe PDF to Word conversion: when opened in Word, the resulting document had picked up images for some graphs, but others had elements scattered all over the page. Also, Word was unable to interpret the images for editing, so was obviously not entirely happy.

Looking at the code of one of the supplied Word documents, there is no easy way to determine what is a graph or not: we would hope to see some kind of tag assigned that would tell us that it is a graph, but there is not: it is a collection of and elements that don’t really mean much in HTML.

We will spend some time looking at whether we can derive more information or, even, whether we are able to generate the graphs as images in code: however, we view the likelihood of success as low.

Adding graphs as images

Graphs should be saved into the doument as images (which, with alt tags, will be more meaningful to those using screen-readers anyway).

Images cannot just be pasted those into the converter editor: the problem is that this does not actually upload the images to our servers (or make them “public”) and hence it displays a missing image icon.

VerseOne does have the ability to upload to our servers through the Media Manager in our system (its the same method as for uploading documents); however, we would then need to develop an interface to re-link those images back into the document, and it may not be immediately obvious — from the names, etc. — which images you needed to link back in, and where (both PDF and Word have a tendency to rename images into an absolutely unique string of letters and numbers).

Short-term fix

Given all of the above, the best and most immediate way in which to deal with this is:

  1. to be supplied with the Word documents;

  2. to create images from the graphs;

  3. to substitute the graphs for the images of the graphs within the Word document;

  4. upload the Word document and convert as currently.

This should ensure decent fidelity and proper rendering of the document.