Extracting date from content (yes, I know, everyone has posted about this)

I am trying to extract the date (or even a date) from an OCR’d document and place it into a custom metadata field called “date found” (meaning found in the document).
The smartrule that I created can do this somewhat, but I would like it to exclude documents whose “date found” has already been filled in.
Why will this smartrule not exclude documents whose “date found” is already filled in?

You could use a construct like this…

does not contain is an exact string query (just like is, is not, begins/ends with etc.). Only matches supports operators & wildcards. Therefore you could simply use Date Found is not and leave the text field empty.

I tried the “does not contain”/empty text field method, but it did not work.

With Blue’s suggestion: Why am I modifying the Author field?
I changed the query to “Date Found” Is [empty text field], and it worked!
Still curious about Blue’s suggestion about the Author, though.

I think that was just meant as an example, not for you to use the Author field

1 Like

That is correct.

I just tested Blue’s method with the “Date Found” field, and it worked perfectly. So my mistake was not using “matches”? When I am trying broadly to define a rule, that is the operator to use?

Not necessarily. It really depends on the specifics of the situation.

I am groaning from the pain of this. I have been doing Boolean searches of text for 36 years. This is killing me. I have been working on this for 10-15 hours at this point.

Here is the newest, nonfunctioning smart group. It finds the documents – i.e., I can see the matches in the Search Inspector; but I cannot get the group to rename the Date Found field (nor to bounce the dock icon, nor anything else):

I had a version of this that successfully inserted into the Date Found field only the month name (\1), but not the date (\2) or year (\3), for some reason; and now I cannot even recreate that, because I have been through so many iterations of this that I no longer know which one worked and which did not.

(jan*|feb*|mar|march|apr*|may|jun*|jul*|aug*|sept*|oct*|nov*|dec|december) & (\d{2}?) & (20??)

(jan*|feb*|mar|march|apr|april|may|jun*|jul*|aug*|sept*|oct*|nov*|dec|december) NEAR/4 ([0-9]{1,2}) NEAR/4 (20??)

(jan*|feb*|mar|march|apr|april|may|jun*|jul*|aug*|sept*|oct*|nov*|dec|december) (\S|\h|\W|\b?)(\d{1,2}?)(\S|\h|\W|\b?)(20??)