OCR triplicating document

Hello! I’m a new and unskilled DT user, so this problem may be a result of my ignorance. Anyways, I need to import photographs of texts (JPG files) into a DT database and then use DT (of course, with the ABBYY add-on) to convert them into OCR’d PDFs. I’m doing this on a pretty massive scale, which is the whole reason I’m even using DT.

I just want to convert each JPG to an OCR’d PDF.

But I have run into what may be a bug with both of the ways I’ve tried to do this.

  1. The easiest (but least automated) way is as follows: I select the file or files (JPG or JPGs) and then Data>OCR>to searchable PDF. When I do this, it creates a triplicate of the OCR’d PDF. (If I select Preferences>OCR>Move to Trash, then it deletes the original JPG file; if I don’t, then it makes the triplicate PDFs and leaves the JPG file in the same location.) Curiously, each of the OCR’d PDFs are also of slightly different file sizes.

  2. I’ve made a Smart Rule configured as follows:

    This, however, leaves me with three copies of the file–the original JPG, a regular PDF, and an OCR’d PDF. Needless to say, I just want the latter! (I’ve tried to make another Smart Rule that I could run after this to delete the unwanted files, but so far to no avail.)

Obviously, I’d love to be able to automate this. I had hoped by using DT I could set it up so that I could drag and drop a folder into my DT database and have DT automatically convert all the JPG files nested within the files/groups into OCR’d PDFs. I guess I would settle for getting 1) above to just allow me to convert JPG to OCR’d PDF without triplicating or duplicating.

Help would be appreciated!

You should always be specific about what you’re targeting, not leaving this section empty…

  1. Here’s file 1. It’s the input.
  2. Convert that file to a PDF (generates a new file). Now that’s the input since you used Convert & Continue. There’s file 2.
  3. OCR that second into a new PDF. That generates a file 3.

Yep! It’s doing exactly what you told it to. :stuck_out_tongue:

Thanks for the comments. I am sure it is doing what I told it to–which is why I’m trying to tell it to do what I want it to do! Any ideas about that?

This is the point^

No need for copies of all intermediate types of files (original JPG, non-OCR PDF), just start with JPG and end with OCR’d PDF.

Yep - I was just building and testing. :slight_smile:

This folder is in the Finder…

Dropped into a targeted group in DEVONthink, yields…


… with this smart group…

  • Note I am targeting only images, and in this case, ones I imported today.
  • The OCR > Apply action processes the file and doesn’t generate a secondary file.
  • The Play Sound is optional but I like it for debugging so I can hear it did something before I see it. Sometimes helpful.
  • It’s also helpful to move files out of a watched folder after the process. This is minimized by targeting files imported today, but you may want to consider adding a Move action at the end.

Convert JPGs.dtSmartRule.zip (1.1 KB)

Hurray! :tada: This seems like exactly what I need! Many thanks!

You’re welcome.

One note: I wouldn’t dump thousands and thousands of files all at once. While it theoretically possible, processing in smaller batches it usually a better option.

While it usually works well, sometimes using this smart rule returns a curious result: rather than convert an image into an OCR’d PDF, it moves the image down and to the right, cropping some of the image out of the frame and adding a large white border to the top and left side. Again, usually it works perfectly. But for some files it does this.

I’ll give you an example. Here’s the JPG:

And here’s the OCR’d PDF:

I’ve tried introducing the JPG files different ways (moving them to a different database and then from there into the relevant database [so that the move triggering the conversion is internal to DT], moving images directly rather than as grouped by a folder in Finder), but the same result occurs. Again, it’s only a subset of the files. Maybe the JPGs have some metadata that somehow triggers DT or the ABBYY add-on to behave strangely (?).

Any idea how to resolve this?

In DEVONthink, hold the Option key and select Help > Report Bug. ZIP and attach a few JPEGs that exhibit this issue. Thanks.

I am unable to use that method for reporting a bug as it is linked to the Mail app, which I do not use. So here is the ZIP. OCR PDF conversion issue.zip (3.6 MB)

I have DT version 3.0.4.

Interesting… it happens even when the image is converted to a TIFF beforehand.
@aedwards, any thoughts on this?

This is caused by an issue with ABBYY’s OCR and unfortunately there isn’t a workaround. We will release a fix for this once we have a fix from ABBYY.

Any progress on this?

The update from ABBYY will be included as part of the v3.1 update that will be available in the next couple of months.

I’ve updated to version 3.5, but this problem still occurs. For example:

As you can see, the right and bottom is still cut off and the white border on the top and left sides is still there–exactly as it was behaving before. Still, as before, not every photo is treated this way. But still, as before, many are.

How may we proceed? This program is very expensive and it’s supposed to be capable of the simple task of converting photos into OCR’d PDFs. It’s been months… Is there something else I can do to get it to work correctly? What gives?

Some more examples:

Here are some images as an example, in both JPG and PDF.

example.zip (4.8 MB)

I have just tried this on the v3.5 build and I am getting the correct result.

Which version of macOS are you running?

These PDFs are made with Finereader v11. The OCR component in DEVONthink 3.5 is v12.
Check DEVONthink 3 > Install Add-Ons for an update.

@BLUEFROG that seems to work–thanks!