OCR failed to find file (It's there - look! There!)

Blanc · October 16, 2019, 10:30am

Minor point: I have just scanned a document, which was named automatically (by ScanSnap Home, I guess), and passed to DT3. DT3 is set to OCR on import (which invariably works without a flaw). The document was named 20191016122238_Spitzenmedizin., \ … nah am Menschen^^fHi \ rl and OCR failed with the the log entry shown in the image. DT3 remained on “loading document”. The document was present in the global inbox. Manually triggering OCR caused the same failure. Once I renamed the document to newname OCR could be triggered manually and proceeded without ado. Is the name above being treated as a path, maybe? Or is one of the special characters tripping up DT?

This is a bit of a freak-occurrence, so probably not to be put too high on the priorities list - if, however, the name is being interpreted as a path, that could in theory (a little far fetched though) lead to data loss.

BLUEFROG · October 16, 2019, 4:57pm

Why is the file named like that?

Blanc · October 17, 2019, 10:25am

Take your pick:

It’s Klingon-Welsh my friend, and I wondered whether you could pronounce it.
I reverse-engineered DT3 during my lunch break, searched for bugs, suspected this string might cause a hiccup, and am now challenging you to figure it out for yourself.
The marvel that is ScanSnap Home extracts bits of what it thinks is text, randomly strings them together and uses the resulting gibberish as a file name.
I just press the scan button, mate. The rest, well, they say it’s magic.

(ScanSnap Home Scan-Settings window)

aedwards · October 18, 2019, 4:26pm

The ’ \’ character in the file name is the problem. The OCR is looking for a file called “rl.pdf” in a folder called “… nah am Menschen^^fHi” which in turn is in a folder called “Spitzenmedizin.,”. Automatic naming can cause these unusable names, although it would be fun if it was Klingon-Welsh

Blanc · October 19, 2019, 2:39pm

Thanks, that’s what I guessed. Is it necessary or useful for DT or the OCR-engine to interpret the file name in this way? Can it be avoided? Although I admit the risk is small, in theory, a file name which leads to a real file (think “/Documents/Important Files/Klingon-Welsh-Dictionary.pdf”) would lead to that file be OCR’d and then moved to trash. Because that would be unexpected behaviour, it’s quite possible that could go unnoticed, so that the original document was lost. Again, I acknowledge the risk must be small. I am also completely out of my depth when I ask whether a file name being interpreted as a path rather than simple a name could be used to introduce malware…

Just thinking out loud here - feel free to say “uhh, no, and just turn off automatic naming if you’re worried”

aedwards · October 21, 2019, 7:46am

It is ScanSnap Home that is proving the DEVONthink with the path so there isn’t much we can do there however I will add additional checks for names that produce non-existent paths.

aedwards · October 22, 2019, 12:12pm

I have added a workaround so that the ABBYY OCR is now happy to process files with names containing “.” or “” so in the next update you shouldn’t get the “OCR unsuccessful” error.

Blanc · October 22, 2019, 5:18pm

cheers Alan, you genius