Inconsistent search results related to Unicode normalization

somelinguist · December 23, 2022, 4:19pm

Hi. There seems to be some strange behavior with Unicode normalization and searches depending on where the original search is typed.

I tried searching for equivalent versions of Unicode characters with diacritics. For example, the letter é can be represented either by one code point (U+00E9) or by two (U+0065 and U+0301).

When searching for the single code point version, DEVONthink correctly always returns results for both the single code point and two code point versions: the correct file results are listed, and selecting a file will list occurrences in the search inspector and highlight them in the preview.

However, when typing the two code point version, the behavior varies depending on where the search was originally typed.

If the two code point version typed in the main search box in the application toolbar, the behavior is the same as the single code point version as described above.

However, if the two code point version is originally typed in a criteria editor text box in an advanced search, the search returns the correct file results, but fails to list individual occurrences in the search inspector and also fails to highlight them in the preview.

Placing the cursor in the search inspector search box and typing enter afterward does list individual occurrences in the selected file and also highlights them in the preview.

It would be preferable if the behavior were always the same as when typed from the main search bar, so that equivalent searches returned the correct results.

I’ve attached a test file that can be used to demonstrate the differences.
Unicode normalization search.pdf (48.5 KB)

Thanks for your help.

BLUEFROG · December 23, 2022, 6:10pm

Thanks for the report!
We will investigate this in the near future.

somelinguist · December 24, 2022, 2:27am

Thanks! Hope you all have time for vacation!

cgrunenberg · January 4, 2023, 2:28pm

Internally DEVONthink should always use the same normalized representation and therefore the results should be identical. Which version of DEVONthink & macOS, which system language and what kind of keyboard do you use?

somelinguist · January 4, 2023, 7:22pm

Hi, Thanks for writing back.

I’m using DEVONthink Pro 3.8.7 on macOS Ventura 13.1.

I do think the actual underlying results are the same regardless of which form is used. The issue is more about the occurrences in being shown in the search inspector panel and highlighted in the preview when selecting a file from the search results.

Everything works as expected when either of the following are true:

The search was typed in the search text box in the main toolbar (regardless of whether the character was typed with the combining diacritics (two code points) or as the precomposed version (single code point).
The precomposed version (single code point) is typed in the criteria editor in the advanced search.

In each of the above cases, selecting on a file in the search results displays the occurrences in that file in the search inspector panel and also highlights them in the preview pane. Here’s a screenshot immediately after selecting the file:

However, if the version with the combining diacritic (two code points) is typed in the criteria editor in the advanced search, the correct files are included in the search results, but no occurrences are immediately shown in the search inspector panel or highlighted in the preview pane. Here is a screenshot immediately after performing the search from the criteria editor in the advanced search and then selecting the file in the results:

After selecting the file, if I place the cursor in the search inspector text field and type return, then the occurrences are shown and highlighted correctly, as pictured here:

So again, I think the underlying results are correctly. But it seems that something doesn’t happen to immediately trigger showing the occurrences in the search inspector when typing the two code point version in the criteria editor in the advanced search panel.

One of the keyboards I’m using to type the two code point version is available here:

https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=ipa-sil_keyboard

It’s the first download on the page, listed as IPA Unicode 6.2 Macintosh Keyboard v1.5

To type the í in the example, type i then @ then 3. The e for the other word can be typed the same way except for replacing the i.

The precomposed (single code point) versions can be typed with any standard Spanish or European keyboard.

Thanks again for your help.

cgrunenberg · January 5, 2023, 7:26am

The issue seems to be definitely related to this keyboard as I couldn’t reproduce this case using the default one. We’ll check this, thanks for your help.

cgrunenberg · January 5, 2023, 11:12am

The website seems to block certain countries, as long as it’s not a temporary server issue:

somelinguist · January 5, 2023, 5:04pm

Hi, Thanks for the info.

It makes sense to me that you can’t reproduce it with the default keyboard, as most keyboards for Latin scripts would only be able to type the precomposed, single code point version of the character.

I do believe it is a Unicode issue, however, as I get the same results when typing using other keyboards that are capable of typing the two-code point version with combining diacritics. It also happens when copying and pasting text containing the two point code versions from other programs.

Again, it seems like the underlying search is correctly doing the Unicode normalization, because the file results are correct, and when it finally highlights after hitting return in the search inspector, the highlighted occurrences are correct.

There just seems to be something that doesn’t make the search inspector automatically list and highlight occurrences when the search for the two code point version is typed in a criteria editor.

That’s strange about the website. I just tried downloading it from there this morning and got the same result.

Here is the same file that I downloaded from there in July 2020: IPA-MACkbd.dmg - Google Drive (I wasn’t able to upload it to the forum).

You can also try copying and pasting one of the words in the second section of the PDF I uploaded in the original post. Pasting one of those words into a search criteria editor and selecting the result file does not list or highlight the occurrences. Pasting it into the search text box in the main toolbar should list and highlight all occurrences in the file.

Thanks again for your help!

somelinguist · January 5, 2023, 5:08pm

I just investigated more, and I think I might have found where the issue occurs.

It looks like the when typing the two code point version in the search in the criteria, the search term that gets copied to the search inspector DOES get normalized to the precomposed single code point version (copying the term out of the search inspector into a program that can describe the Unicode code points (CotEditor) shows that it has been changed to the single code point version).

In the case of comí, U+0069 LATIN SMALL LETTER I plus U+0301 COMBINING ACUTE ACCENT becomes U+00ED LATIN SMALL LETTER I WITH ACUTE.

However, it looks like the when typing the two code point version in a criteria editor, the search term that gets copied to the search inspector is NOT normalized (copying the term out of the search inspector into the other program shows that it retains the two-code point version that was originally typed in the criteria editor).

In the case of comí, U+0069 LATIN SMALL LETTER I plus U+0301 COMBINING ACUTE ACCENT stays the same.

However, typing return after that in the search inspector DOES perform normalization to the precomposed version (copying and pasting the term from the inspector to the other programs show that it has been normalized to the single point precomposed version).

In the case of comí, U+0069 LATIN SMALL LETTER I plus U+0301 COMBINING ACUTE ACCENT becomes U+00ED LATIN SMALL LETTER I WITH ACUTE.

It would be preferable if typing the search from either place always performed normalization on the text that appears in the text box in the search inspector. I think that would solve the problem.

cgrunenberg · January 6, 2023, 11:12am

Thanks for the keyboard layout, I was able to reproduce the issue now and the next maintenance release will fix this.

somelinguist · January 6, 2023, 2:08pm

Thanks so much!