DT3b2 Smart rule issue with OCR

ngan · May 23, 2019, 10:47pm

I setup a smart rule to OCR whatever is scanned and saved by my network scanner into an index group. My OCR preference “Move to Trash” is enabled. This setup is working normally in DT3b1.

Now in DT3b2, after the OCR, the original file is deleted in the index folder (observed from Finder) but the index group is left with an item showing

At the same time, the pdf+text file is automatically moved into the Global Inbox.

Thanks in advance

lutefish · May 23, 2019, 11:26pm

I was just about to post this bug - the new OCR file appears in the Global Inbox, but is deleted from the indexed folder.

Also, there is an error with the OCR: the file was a single page of text in landscape, but not identified as such, and Abby Reader has (correctly) flipped the image, but cropped out content.Waleys 16-22-01-008.pdf (328.8 KB)

BLUEFROG · May 24, 2019, 2:43am

If you have the option to delete the original, then logically the original - the indexed file in the Finder - would be deleted.

If you used OCR > Apply, the file would be converted in place and remain indexed.

ngan · May 24, 2019, 5:01am

Hi

The main issue is that the OCRed file is automatically moved to the global inbox by the same smart rule under DT3b2 (which it shouldn’t), that’s why the original group is left with a missing file.

BLUEFROG · May 24, 2019, 2:28pm

@aedwards or @cgrunenberg would have to assess this.

ngan · May 24, 2019, 2:39pm

Thank you.

lutefish · May 29, 2019, 11:04pm

While this is being assessed, is there any way to move the newly created item back to where it came from? I’m trying to “Move” the new files back to the databases from which they’ve come, but I can’t figure out how to identify that (unless I create a smart rule for each database, and then run those separately). Thanks.

BLUEFROG · May 30, 2019, 12:10am

If you are running a smart rule, what did you define as your target - a specific group, all databases, …?

lutefish · May 30, 2019, 12:23am

“All databases.”

The event trigger was “on clipping” of kind “web archive” - I’m trying to convert Web archives I’ve clipped over the last few years to PDFs.

BLUEFROG · May 30, 2019, 12:32am

OCR wouldn’t convert webarchives to PDF ?

lutefish · May 30, 2019, 12:49am

OCR wasn’t available as a direct option. Selecting a webarchive directly, and right-clicking and selecting “OCR->” from the menu shows only grayed-out options.

Right clicking it and “convert-> to PDF” (which I hadn’t tried before) in fact creates a new PDF+Text item. (So, the original issue with OCR in 3b2 moving items to the global inbox, which still happens, no longer needs to be solved for this particular rule).

So, I suppose 1) I don’t need to OCR webarchive files, merely apply a rule that converts webarchives to PDF+Text, but now I’m trying to figure out how to 2) set the Date Created and Date Modified back to those of the original document, and 3) move the webarchive to trash.

lutefish · May 30, 2019, 12:53am

(Oh, and how to copy over the original URL and metadata. This is no longer a bug report. sorry)

BLUEFROG · May 30, 2019, 1:25pm

Conversion from webarchive to PDF preserves the metadata, etc.

Note, this creates a paginated PDF.

It’s also possible to use the Execute Script > Embedded (only using embedded for convenience here) with this code…

on performSmartRule(theRecords)
	tell application id "DNtp"
		repeat with theRecord in theRecords
			create PDF document from (URL of theRecord as string) in (current group) without pagination
		end repeat
	end tell
end performSmartRule

lutefish · May 30, 2019, 5:12pm

Bluefrog - many, many thanks. That works as expected, and I can tinker with it as necessary.

BLUEFROG · May 30, 2019, 5:29pm

Many, many welcomes back to you. Glad to help!