OCR to database metadata?

I wonder if it is possible to OCR text to a database comments field rather than "Finder Comments” i.e not in the file/filesystem but only the database.

In general: yes. But why? If you have OCRd text, it’s probably more then just a few words. Why would you want that as metadata in addition to the text layer that is already there anyway?

Like Data > OCR > to Comment?

1 Like

Like Data > OCR > to Comment?

That actually uses the "Finder Comment” not the comment metadata filed in the db

in addition to the text layer that is already there anyway?

These aren’t PDFs its for image files. Therefore, there is no text layer.
The idea was to OCR the image text into metadata for search and any text based processing etc.

I can make a smart group to OCR to finder comments , copy finder comment to comments in metadata then have a script to remove the finder comments. But this is a rather long process and is probably slower than it needs to be.

You could use Apple’s Vision framework to do the OCR in a script. I posted some sample code here some months ago.

But Vision is not perfect by any means. It has problems to assemble lines reliably, for example.

Are you referring to the Document > Properties > Comment field?

The internal Finder comments value is actually unlimited and stored in the database, it’s just mirrored to the Finder comments in the filesystem too which are limited though.

Anyway, the overhead of such a smart rule is actually neglectable. OCR requires most of the time.

@BLUEFROG

Are you referring to the Document > Properties > Comment field?

Yes

@cgrunenberg

it’s just mirrored to the Finder comments in the filesystem too which are limited though.

this is fine I just don’t want them mirrored to the filesystem (I expect these are extended file attributes?)

The Finder actually still uses a proprietary storage (hidden .DS_Store files).

Why does it need to be that particular field?

Why does it need to be that particular field?

@BLUEFROG

It doesn’t need to be a specific field as such it just needs to be a field that doesn’t have side effects e.g changing file embedded metadata (I don’t think this happens in devonthink anyway), file attributes via Finder Comments field, creating files via annotation fields. Currently the only OCR options are "Finder Comments” or annotation fields.

The Finder actually still uses a proprietary storage (hidden .DS_Store files).

@cgrunenberg Yes, depends on filesystem a modern macOS system uses APFS which uses xttr (extended attributes). .DS_Store can be used in archives (although many now support xttr and other than backups aren’t even used for basic archives), old filesystems which dont have extended attributes and non Apple. .DS_Store can be use on network shares, however, it is common with macOS clients to disable .DS_Store files from being created on network shares.

Actually I can’t confirm this. Earlier today I created a new file on an APFS volume, added a Finder comment directly in the Finder on Monterey and checked the extended attributes afterwards. No comment. Ironically there’s an extended attribute using the Ventura beta but it’s not the first macOS beta that does this (and then Apple dropped it again).

@cgrunenberg I stand corrected then, I know .DS_Store still exist for Finder settings still but I would have thought a comment would be an extended attribute under APFS using com.apple.FinderInfo. So does the current Ventura use extended attributes?

But I don’t see why this should matter to the issue that I am having?
I just would like to store the OCR of image files into a field that is in the database only.

@cgrunenberg

Actually my Mac macOS Montery 12.6 does

Finder comment is “This is a test comment”

/usr/bin/xattr -l relNotes\(1\).html                                                                   
com.apple.macl:
com.apple.metadata:kMDItemFinderComment: bplist00_This is a test comment
com.apple.metadata:kMDItemWhereFroms: bplist00?_Fhttps://dl.dropboxusercontent.com/s/XXXXXXXXXX/relNotes.html?dl=0P

But it also create a .DS_Store

It doesn’t but you asked this:

:slight_smile:

I did didn’t I :man_facepalming::stuck_out_tongue_winking_eye:

Hang on… why aren’t you just OCR’ing to a searchable PDF instead of leaving the file as an image?