Difficulty Exporting PDF to MS Word

jwarthman · January 18, 2015, 12:30am

I’m using DTPO version 2.8.2 on OS X 10.10.1.

I have a large number of .png files (screen shots from an iPad) that I want to make “editable”. my process was as follows:

Import .png files into DTPO
Convert all .png files to Editable PDF files
Merge all the PDF files into a single file
File > Export > As Word document

The PDF files, and the merged PDF, all look great! They have a mix of text and images, and the text is selectable, as you would expect. Even when I open the merged PDF file in Preview, it looks great.

The problem I’m having is that, when I open the MS Word version after export, the text is mostly ok, but instead of the photos I get odd, random text like this:

{
VsAfc^^L^Sr^L^^. '^fL :-^L mA^J^2
V
X1
y Jl;W
p* ai
i
/ ,-
’ \

I have tried various other ways of getting the info into MS Word, including export in other formats (RTF, OPML), and even copy/paste, all with similar results.

So the question is: how can I get all the content of my .PDF from DTPO into MS Word?

Thanks!

Jim

korm · January 18, 2015, 4:25am

Converting images (.png) into PDF uses DEVONthink’s OCR process. OCR attempts to recognize text in images, but is not necessarily 100% reliable. Even when all the text in your screen shots is properly recognized, there can be other artifacts on a iPad screen that are not text but that the OCR process interprets as text. A further complicating factor is converting PDF to Word. At this stage you have Word re-converting the previous text, and that can itself introduce errors.

I did a little test just now. Taking a iPad screen shot of this forum page with your question, transfering it to DEVONthink on the desktop, OCRing it, and then selecting that PDF+Text file and having DEVONthink convert it to plain text. The latter is the best method DEVONthink has of showing you the text layer of a PDF - which is OCR produces. I’d estimate that 20% of the text on that page is gibberish. For example, this portion of the screen print

is OCRd as

This isn’t the result of error or bugs – it’s just that portions of the screen that you think are human-readable are not machine-readable.

Your process looks reasonable, but it cannot be perfect. You’ll always need to edit the text. If you are screen-printing web pages, you’ll be better served by using software on your iPad that extracts text from HTML. (GoodReader can do that – there are plenty of others.)

jwarthman · January 18, 2015, 5:47am

Korm,
Thanks for responding. I agree totally with what you say. In fact, I do quite a lot of OCR with scanned images from my Fujitsu ScanSnap iX500, and I’m familiar with the kinds of errors that can result.

But I don’t think OCR issues are the problem here.

This has more to do with DTPO constructing PDF files containing images and text (with a text layer) which, for some reason, cannot be successfully converted to .DOCX or .RTF.

The issue is that none of the images make it across, and in their place are these random characters.

I do not believe this has anything to do with the OCR.

Thanks,

Jim

korm · January 18, 2015, 1:10pm

It’s “Korm”, please.

OCR (DEVONthink’s or any other product’s) creates PDFs with both image and text layers.

Putting DEVONthink aside entirely – have you ever been able to open in Word an OCRd PDF made with any product and see both text and images? (I assume you are expecting the text and images to appear separately in the document). Just trying to understand the source of the expectation for how this should work. Over here, Acrobat Pro XI and PDFPen Pro 7 both make garbled messes when they OCR iPad PNG screen prints and convert them to .docx, so it would be helpful to know the baseline – the product that actually works as expected.

jwarthman · January 18, 2015, 7:04pm

Sorry - fixed.

korm:

jwarthman:

has more to do with DTPO constructing PDF files containing images and text (with a text layer) which, for some reason, cannot be successfully converted to .DOCX or .RTF. The issue is that none of the images make it across, and in their place are these random characters. I do not believe this has anything to do with the OCR.

OCR (DEVONthink’s or any other product’s) creates PDFs with both image and text layers.

Putting DEVONthink aside entirely – have you ever been able to open in Word an OCRd PDF made with any product and see both text and images? (I assume you are expecting the text and images to appear separately in the document). Just trying to understand the source of the expectation for how this should work. Over here, Acrobat Pro XI and PDFPen Pro 7 both make garbled messes when they OCR iPad PNG screen prints and convert them to .docx, so it would be helpful to know the baseline – the product that actually works as expected.

Good question! I have never needed to do this in the past, so today I did an experiment.

I printed the PDF that I constructed using the .PNG files (see my first post in this thread). The hard copy looked just like the original .PNG files, which is to say the text and pictures were interspersed.

I then set up ABBYY FineReader for ScanSnap to use the “ABBYY Scan to Word” setting, and I scanned the printed document.

The resulting Word document was not perfect. At least one picture was totally absent, and each element (picture or text block) appeared in its own “frame” in Word. But the results were much better than when I exported the PDF in Word format from DTPO. Much better! I can’t help but wonder how the “full version” of ABBYY FineReader Pro for Mac or ABBYY FineReader 12 Professional for Windows would perform on this task, but unless I can get a sale price, I won’t bother trying to find out.

Thanks for asking, Korm. This experiment has confirmed that scanning, then OCRing, then exporting to MS Word is less than perfect! But it has also shown that we can get much better results using tools other than DTPO for this task.

To be clear, I love, love, love DTPO! This is the first time I’ve needed to export a pdf to Word, and the results were disappointing to say the least. But in other ways, DTPO remains my go-to tool of choice!