OCR triplicating document

simnew · February 1, 2020, 6:27pm

Hello! I’m a new and unskilled DT user, so this problem may be a result of my ignorance. Anyways, I need to import photographs of texts (JPG files) into a DT database and then use DT (of course, with the ABBYY add-on) to convert them into OCR’d PDFs. I’m doing this on a pretty massive scale, which is the whole reason I’m even using DT.

I just want to convert each JPG to an OCR’d PDF.

But I have run into what may be a bug with both of the ways I’ve tried to do this.

The easiest (but least automated) way is as follows: I select the file or files (JPG or JPGs) and then Data>OCR>to searchable PDF. When I do this, it creates a triplicate of the OCR’d PDF. (If I select Preferences>OCR>Move to Trash, then it deletes the original JPG file; if I don’t, then it makes the triplicate PDFs and leaves the JPG file in the same location.) Curiously, each of the OCR’d PDFs are also of slightly different file sizes.
I’ve made a Smart Rule configured as follows:

Screen Shot 2020-02-01 at 13.19.171260×558 113 KB

This, however, leaves me with three copies of the file–the original JPG, a regular PDF, and an OCR’d PDF. Needless to say, I just want the latter! (I’ve tried to make another Smart Rule that I could run after this to delete the unwanted files, but so far to no avail.)

Obviously, I’d love to be able to automate this. I had hoped by using DT I could set it up so that I could drag and drop a folder into my DT database and have DT automatically convert all the JPG files nested within the files/groups into OCR’d PDFs. I guess I would settle for getting 1) above to just allow me to convert JPG to OCR’d PDF without triplicating or duplicating.

Help would be appreciated!

BLUEFROG · February 1, 2020, 7:07pm

You should always be specific about what you’re targeting, not leaving this section empty…

Here’s file 1. It’s the input.
Convert that file to a PDF (generates a new file). Now that’s the input since you used Convert & Continue. There’s file 2.
OCR that second into a new PDF. That generates a file 3.

Yep! It’s doing exactly what you told it to.

simnew · February 1, 2020, 7:15pm

Thanks for the comments. I am sure it is doing what I told it to–which is why I’m trying to tell it to do what I want it to do! Any ideas about that?

simnew · February 1, 2020, 7:19pm

This is the point^

No need for copies of all intermediate types of files (original JPG, non-OCR PDF), just start with JPG and end with OCR’d PDF.

BLUEFROG · February 1, 2020, 7:20pm

Yep - I was just building and testing.

This folder is in the Finder…

Dropped into a targeted group in DEVONthink, yields…

… with this smart group…

Note I am targeting only images, and in this case, ones I imported today.
The OCR > Apply action processes the file and doesn’t generate a secondary file.
The Play Sound is optional but I like it for debugging so I can hear it did something before I see it. Sometimes helpful.
It’s also helpful to move files out of a watched folder after the process. This is minimized by targeting files imported today, but you may want to consider adding a Move action at the end.

Convert JPGs.dtSmartRule.zip (1.1 KB)

simnew · February 1, 2020, 7:26pm

Hurray! This seems like exactly what I need! Many thanks!

BLUEFROG · February 1, 2020, 7:28pm

You’re welcome.

One note: I wouldn’t dump thousands and thousands of files all at once. While it theoretically possible, processing in smaller batches it usually a better option.

simnew · February 16, 2020, 5:07pm

While it usually works well, sometimes using this smart rule returns a curious result: rather than convert an image into an OCR’d PDF, it moves the image down and to the right, cropping some of the image out of the frame and adding a large white border to the top and left side. Again, usually it works perfectly. But for some files it does this.

I’ll give you an example. Here’s the JPG:

And here’s the OCR’d PDF:

I’ve tried introducing the JPG files different ways (moving them to a different database and then from there into the relevant database [so that the move triggering the conversion is internal to DT], moving images directly rather than as grouped by a folder in Finder), but the same result occurs. Again, it’s only a subset of the files. Maybe the JPGs have some metadata that somehow triggers DT or the ABBYY add-on to behave strangely (?).

Any idea how to resolve this?

BLUEFROG · February 16, 2020, 5:47pm

In DEVONthink, hold the Option key and select Help > Report Bug. ZIP and attach a few JPEGs that exhibit this issue. Thanks.

simnew · February 16, 2020, 7:04pm

I am unable to use that method for reporting a bug as it is linked to the Mail app, which I do not use. So here is the ZIP. OCR PDF conversion issue.zip (3.6 MB)

I have DT version 3.0.4.

BLUEFROG · February 17, 2020, 3:19pm

Interesting… it happens even when the image is converted to a TIFF beforehand.
@aedwards, any thoughts on this?

aedwards · February 17, 2020, 5:24pm

This is caused by an issue with ABBYY’s OCR and unfortunately there isn’t a workaround. We will release a fix for this once we have a fix from ABBYY.

simnew · February 29, 2020, 4:26am

Any progress on this?

aedwards · March 2, 2020, 9:02am

The update from ABBYY will be included as part of the v3.1 update that will be available in the next couple of months.

simnew · May 22, 2020, 11:59pm

I’ve updated to version 3.5, but this problem still occurs. For example:

As you can see, the right and bottom is still cut off and the white border on the top and left sides is still there–exactly as it was behaving before. Still, as before, not every photo is treated this way. But still, as before, many are.

How may we proceed? This program is very expensive and it’s supposed to be capable of the simple task of converting photos into OCR’d PDFs. It’s been months… Is there something else I can do to get it to work correctly? What gives?