DTPO & ABBYY Engine 8.x

DTPO currently uses the ABBYY FineReader Engine 8.x according to the “About” menue.
According to a phone call to ABBYY Company there is a newer engine out. Do you plan of upgrading the engine used by DTPO ?
Why do I ask? Well, when I started using DTPO a scanned page was around 100k. Now, one page is around 1M. I also was hoping the the already very good OCR might improve a bit with a later version of ABBYY. Especially when it comes to numbers the version 8.x seems to have problems.
kind regards,
Chris

Yes, a future release will definitely upgrade the engine.

Will that upgrade include multithreading? Or is the licensing for that still infeasible?

(crossing my fingers)

Comment: I’m evaluating the Xcanex portable book and document scanner, which currently runs only under Windows. The scanner itself is a little marvel, with software that does a good job of correcting page curvature and perspective distortion when scanning books, especially two pages at a time.

To play with filetype output variations I bought ABBYY FineReader 12.0.4 for Mac, which does allow use of multiple cores with my 4-core i7 CPU. Yes, OCR processing is much faster than under the ABBYY OCR module used currently in DEVONthink Pro Office.

But there’s a downside to loading up the CPU. OCRing hundreds of pages warmed up my MacBook Pro Retina almost as much as my old TiBook (remember when people used to complain about getting burns when laptops were used on laps?), and the computer became less responsive for other uses while it was pounding away at text recognition. By contrast, current OCR in DEVONthink Pro Office is optimized for “background” OCR processing, requiring more disk space but less use of memory and CPU during OCR than the latest version of ABBYY for Mac.

Good catch. I know some multicore software will either let you specify how many cores to use, or will default to leaving a few free so that the machine is still responsive.

FWIW, the machine my scansnap is attached to is a 27" iMac, which has really efficient cooling so I’d be perfectly happy letting it go wild when I scan several hundred pages of stuff.

Yes, multiple cores will be supported too.

HOORAY!!!

Will the upgrade include Chinese langage support?

I too have got a question regarding the ABBYY Engine in DTPO:

Is the Engine, whether version 8 or 9, a stripped version of ABBYY Fine Reader Pro in regards of the core OCR features?

The reason I am asking is I sometimes have problems with line endings in PDFs converted into RTF—they got CRLF at the end of every line so they look like poems.

Nothing against poetry but in a multiple page text of prose this becomes very annoying. Especially when there are no empty lines between the real paragraphs which would have allowed to remove every single newline and only keep the double ones by search and replace.

Also annoying: Hyphens from former line-endings that do not get removed.

So I was thinking about finding an additional solution to the included OCR of DTPO (which is fine most of the time). And ABBYY Fine Reader Professional was the first program that came to my mind. But if there is no difference in the core OCR features of the Engine and the Professional version that would just be a waste of money as I don’t need any of the non OCR related features of ABBYY Fine Reader Professional like archiving.

By the way, the future release of DTPO that includes version 9 of the ABBYY engine will be a paid-for version of DTPO, will it not?

I’m not sure if the conversion to RTF is part of the ABBYY code or the DTPO code. All versions of DEVONthink can convert to PDF, so that would imply that this feature isn’t using the OCR engine. The OCR is used to add the text layer to the PDF, and then DEVONthink uses the text to create the RTF.

I do have FineReader Pro OCR 12.0.6, as well as Acrobat Pro XI, so I can try to see if either of them handle line endings better. Are there any PDFs online that show the problem you could point me to? I’ll test and send you results so you can decide if it’s worth it.

Thank you, Alan, for your generous offer!

I’ll send you a link to a pdf as a private message.

I’ve responded with results as a PM, including a link to output files. But it’s in my Outbox, not send messages… Not sure what that is about.

For other interested parties, here are my results.

I’ve tried doing an export as Word and PDF for Acrobat Pro XI, PDFPen Pro 6.1.3, and Finereader 12.

Here are the output files for you to look at. Acrobat did not handle hyphens or page numbers at all, so they’re in the output. Disappointing.

PDFPen Pro uses an online conversion service from OmniPage, and that seemed to do a decent job at removing things.

Finereader did it internally and definitely had the most configurability, allowing you to remove or keep hyphens, page numbers, line numbers, etc. I think it had the best output, but there were a couple things it messed up… A weird zeil 1 at the beginning and it missed a hyphen somewhere. I had to run this one twice: German wasn’t in my initial set of languages, and the results were really bad. So you need to set the language beforehand.

This is a very good proof why it is reasonable when one plans to buy a software not only to first check out the software itself but also its user community.

So I did back in the days before I purchased DTPO and I found this community very much alive and helpful.

Thank you, Alan, for spending your time and effort on finding a solution for my problem!

Is it possible to buy the latest ABBYY FineReader Pro and use it in DT instead of the current DT ver.?
I’m especially missing the Hebrew OCR that available in the new ABBYY FineReader Pro ver.

Thanks

You may use the ABBYY FineReader or other OCR software to process images for text recognition and save them to a Finder folder, from which they can be captured to a DEVONthink Pro Office database. This would be appropriate for languages that are not supported under our license from ABBYY.

If so, you can add a Folder Action to the Finder folder to which OCRed files are saved. To do that:

  • In the Finder, select the folder and Control-click on it
  • Choose Services > Folder Action Setup
  • Choose DEVONthink - Import, OCR and Delete.scpt

As image files are saved to that folder, the script will result in OCR and capture to a DEVONthink database, followed by deletion of the original image file from the folder.

When ABBYY is eventually updated for the app, will it support OCR of Chinese, Japanese, and Korean?

I can’t make any promises as I don’t control what ABBYY does, but from what I’ve seen, I believe it will.