Hello! I’m a new and unskilled DT user, so this problem may be a result of my ignorance. Anyways, I need to import photographs of texts (JPG files) into a DT database and then use DT (of course, with the ABBYY add-on) to convert them into OCR’d PDFs. I’m doing this on a pretty massive scale, which is the whole reason I’m even using DT.
I just want to convert each JPG to an OCR’d PDF.
But I have run into what may be a bug with both of the ways I’ve tried to do this.
The easiest (but least automated) way is as follows: I select the file or files (JPG or JPGs) and then Data>OCR>to searchable PDF. When I do this, it creates a triplicate of the OCR’d PDF. (If I select Preferences>OCR>Move to Trash, then it deletes the original JPG file; if I don’t, then it makes the triplicate PDFs and leaves the JPG file in the same location.) Curiously, each of the OCR’d PDFs are also of slightly different file sizes.
This, however, leaves me with three copies of the file–the original JPG, a regular PDF, and an OCR’d PDF. Needless to say, I just want the latter! (I’ve tried to make another Smart Rule that I could run after this to delete the unwanted files, but so far to no avail.)
Obviously, I’d love to be able to automate this. I had hoped by using DT I could set it up so that I could drag and drop a folder into my DT database and have DT automatically convert all the JPG files nested within the files/groups into OCR’d PDFs. I guess I would settle for getting 1) above to just allow me to convert JPG to OCR’d PDF without triplicating or duplicating.
Note I am targeting only images, and in this case, ones I imported today.
The OCR > Apply action processes the file and doesn’t generate a secondary file.
The Play Sound is optional but I like it for debugging so I can hear it did something before I see it. Sometimes helpful.
It’s also helpful to move files out of a watched folder after the process. This is minimized by targeting files imported today, but you may want to consider adding a Move action at the end.
One note: I wouldn’t dump thousands and thousands of files all at once. While it theoretically possible, processing in smaller batches it usually a better option.
While it usually works well, sometimes using this smart rule returns a curious result: rather than convert an image into an OCR’d PDF, it moves the image down and to the right, cropping some of the image out of the frame and adding a large white border to the top and left side. Again, usually it works perfectly. But for some files it does this.
I’ve tried introducing the JPG files different ways (moving them to a different database and then from there into the relevant database [so that the move triggering the conversion is internal to DT], moving images directly rather than as grouped by a folder in Finder), but the same result occurs. Again, it’s only a subset of the files. Maybe the JPGs have some metadata that somehow triggers DT or the ABBYY add-on to behave strangely (?).
I am unable to use that method for reporting a bug as it is linked to the Mail app, which I do not use. So here is the ZIP. OCR PDF conversion issue.zip (3.6 MB)
As you can see, the right and bottom is still cut off and the white border on the top and left sides is still there–exactly as it was behaving before. Still, as before, not every photo is treated this way. But still, as before, many are.
How may we proceed? This program is very expensive and it’s supposed to be capable of the simple task of converting photos into OCR’d PDFs. It’s been months… Is there something else I can do to get it to work correctly? What gives?