Tilde operator for hunting historical names ...and soundex script?

Tilde

I find the tilde (~) operator so useful in my genealogy/historical research databases. You know how it is, with your ancestor named Benjamin Haskins born in 1751 whose name could be written or transcribed in a variety of forms including Benj^n Hoskin.

Say hello to the Boolean tilde operator! If I want to search for my elusive Ben in documents and annotations in my database, including thousand-page searchable PDFs downloaded from archive.org or FamilySearch’s Digital Library, I can easily do so with a Smart Group defined by All matches ~Ben NEAR/2 ~H?skin, which looks for words that contain Ben within two words of words that contain H_skin. (The ? looks for any one character, and since Haskins contains H?skin, it is included in search results.)

This also works for my ancestors with surname Van Houten/Vanhouten/Hooten/Hotten: I just need to search for ~ho?ten, and I don’t need to think about the extensive list of names I’d need to use in an OR statement.

If needed, though, you can combine tilde operator search terms in OR (or other) statements, like for the impossible-to-find-all-variations-of Phettiplace. One census enumerator recorded the family as Pplace. In many other records/books, the surname is spelled with either P or Ph or F at the beginning, followed by I or E, with one or two Ts in the middle and either I or Y or E after that. Instead of just doing ~place, which would result in far too many false positives (e.g., commonplace, placemat), I could do something like (Pplace OR ~?tt?place OR ~?t?place). This OR statement would catch Fhitteplace under ~?tt?place and Petyplace under ~?t?place. (If there is a way to use an operator to search for only one or two characters, I’d love to know what that is. I’ve always understood ? to search for only one character and ?? to search for only two characters.)

You could definitely get false positives, but for the most part it is easier to weed out those than to review the documents that wouldn’t be included in the first place. :woman_shrugging:

Soundex

Do any of you master script writers think it would be possible to create a search script based on Soundex? (There’s also a German soundex variety, which I have the php script written by Nicholas Zimmer, but I can’t find it online anymore to link to it.)

The soundex page above provides an in-depth on the system, but in brief, each letter of the alphabet is assigned to one of the numerals 1-6:

Number Represents the Letters
1 B, F, P, V
2 C, G, J, K, Q, S, X, Z
3 D, T
4 L
5 M, N
6 R

So, the elusive Mr. Benjamin Haskins could be represented as H252. Other names coded H-252 include Higgins and Hawkins. (This online soundex calculator works but doesn’t correctly calculate the HW rule described in NARA’s page above. There are others out there, but most of them ignore the HW rule. If you are looking for one, test it with “Ashcraft”: it should give you A-261 not A-226.) (I previously created my own soundex script for a php website I had a long time ago that correctly incorporates the HW rule and others, but I’ve never learned Apple scripting, nor is it on my to-do list, so me translating my script is not a viable option. However, I can share it if someone wants to take a crack at making an Apple script version.)

If a search script is possible, would it also then be possible to call it using a character like the tilde?

Cheers.

Soundex would be great. There’s a Perl implementation but I’ve never tried using it.

JavaScript implementation for Soundex here: Soundex in JavaScript · GitHub
And in other languages StackPath
Understandably not in AppleScript, though.
Since DT understands JavaScript, you could try to use that.
Since the Soundex algorithm seems to be geared exclusively to one language and script, it’s usage is quite limited, I think (French? Russian? Mandarin?). Shouldn’t there be better ways now (the method is about a hundred years old!)?

Someone created a German Soundex quite a few years ago. Nick somebody, I think.

One down, 5998 to go

1 Like