Converting images (.png) into PDF uses DEVONthink’s OCR process. OCR attempts to recognize text in images, but is not necessarily 100% reliable. Even when all the text in your screen shots is properly recognized, there can be other artifacts on a iPad screen that are not text but that the OCR process interprets as text. A further complicating factor is converting PDF to Word. At this stage you have Word re-converting the previous text, and that can itself introduce errors.
I did a little test just now. Taking a iPad screen shot of this forum page with your question, transfering it to DEVONthink on the desktop, OCRing it, and then selecting that PDF+Text file and having DEVONthink convert it to plain text. The latter is the best method DEVONthink has of showing you the text layer of a PDF - which is OCR produces. I’d estimate that 20% of the text on that page is gibberish. For example, this portion of the screen print
This isn’t the result of error or bugs – it’s just that portions of the screen that you think are human-readable are not machine-readable.
Your process looks reasonable, but it cannot be perfect. You’ll always need to edit the text. If you are screen-printing web pages, you’ll be better served by using software on your iPad that extracts text from HTML. (GoodReader can do that – there are plenty of others.)
Thanks for responding. I agree totally with what you say. In fact, I do quite a lot of OCR with scanned images from my Fujitsu ScanSnap iX500, and I’m familiar with the kinds of errors that can result.
But I don’t think OCR issues are the problem here.
This has more to do with DTPO constructing PDF files containing images and text (with a text layer) which, for some reason, cannot be successfully converted to .DOCX or .RTF.
The issue is that none of the images make it across, and in their place are these random characters.
I do not believe this has anything to do with the OCR.
OCR (DEVONthink’s or any other product’s) creates PDFs with both image and text layers.
Putting DEVONthink aside entirely – have you ever been able to open in Word an OCRd PDF made with any product and see both text and images? (I assume you are expecting the text and images to appear separately in the document). Just trying to understand the source of the expectation for how this should work. Over here, Acrobat Pro XI and PDFPen Pro 7 both make garbled messes when they OCR iPad PNG screen prints and convert them to .docx, so it would be helpful to know the baseline – the product that actually works as expected.
Good question! I have never needed to do this in the past, so today I did an experiment.
I printed the PDF that I constructed using the .PNG files (see my first post in this thread). The hard copy looked just like the original .PNG files, which is to say the text and pictures were interspersed.
I then set up ABBYY FineReader for ScanSnap to use the “ABBYY Scan to Word” setting, and I scanned the printed document.
The resulting Word document was not perfect. At least one picture was totally absent, and each element (picture or text block) appeared in its own “frame” in Word. But the results were much better than when I exported the PDF in Word format from DTPO. Much better! I can’t help but wonder how the “full version” of ABBYY FineReader Pro for Mac or ABBYY FineReader 12 Professional for Windows would perform on this task, but unless I can get a sale price, I won’t bother trying to find out.
Thanks for asking, Korm. This experiment has confirmed that scanning, then OCRing, then exporting to MS Word is less than perfect! But it has also shown that we can get much better results using tools other than DTPO for this task.
To be clear, I love, love, love DTPO! This is the first time I’ve needed to export a pdf to Word, and the results were disappointing to say the least. But in other ways, DTPO remains my go-to tool of choice!