DT3 to go - Anzahl "PDFs ohne Text"

ovicula · February 16, 2021, 8:16am

Hallo zusammen,

vielleicht habe ich auch nur einen Denkfehler - folgende Beobachtung:

Ich habe in meiner DB ca. 3.300 PDF Dateien, alle durchsuchbar via OCR (die meisten wurden mit Scansnap Home oder OCRkit durchsuchbar gemacht).
Erstelle ich in DT 3 eine intelligente Gruppe und suche nach PDF-Dateien mit 0 Buchstaben, wurden mir vorhin ca. 10 Dateien ausgegeben, auf die das korrekterweise zutraf.

Aktiviere ich in DT 3 to go die Ansicht “PDFs ohne Text”, werden mir knapp 3.300 Dateien angezeigt.
Lasse ich jetzt in iOS einmal OCR über eine schon durchsuchbare PDF laufen, verringert sich die Anzahl “PDFs ohne Text” entsprechend.

Wo ich in DT 3 jetzt die 10 PDF Dateien via OCR durchsuchbar gemacht habe und einen sync gemacht habe, verringerte sich die Zahl der “PDFs ohne Text” in DT 3 to go auch um 10.

Interpretiere ich jetzt irgendwas falsch oder zählt DT 3 to go nur durchsuchbare PDFs, die mit DT 3 (to go) durchsuchbar gemacht wurden und ignoriert durchsuchbare PDFs anderer Quellen?

Bug, Feature oder Fehler 40?

Blanc · February 16, 2021, 9:09am

Dieser Eintrag im Changelog “The word count provided through a sync wasn’t always correctly stored and could lead to the document appearing in the PDFs Without Text smart group. Fixed.” legt für mich nahe, dass DTTG sich (auch) auf Angaben vom Mac verlässt bei der Einteilung der PDFs in “mit Text” und “ohne Text”. Ich kann aber bestätigen, dass es (zumindest bei mir) nicht funktioniert - sowohl in DTTG 3.0.1 als auch 3.0.2 werden mir tausende PDFs in der DTTG Intelligente Gruppe “PDFs ohne Text” aufgeführt; in diesen PDFs kann ich aber Text markieren, kopieren, etc… das ist dann wohl ein Bug (@eboehnisch)

eboehnisch · February 16, 2021, 10:07am

Die Wortanzahl wird, damit sich damit auch intelligente Gruppen bauen lassen, in der Datenbank abgelegt. Ich würde einmal eine Metadaten-Neuindizierung auslösen. Kopieren Sie den folgenden Link in die Adressleiste von Safari und drücken Sie die Eingabetaste:

x-devonthinktogo://reindex-metadata

Dies geht alle Dokumente durch und aktualisiert auch Buchstaben- und Wörterzahlen.

ovicula · February 16, 2021, 10:34am

Hab ich auf beiden iOS Geräten durchgeführt und die Apps beendet und wieder gestartet - ohne Effekt in dieser Problematik

gps2003 · February 16, 2021, 10:42am

Bei mir hat das auch nicht geklappt, angeblich habe ich über 12000 pdf ohne text. In Wirklichkeit sind es aber nur ein paar Wenige.

eboehnisch · February 16, 2021, 12:50pm

Könnten Sie dann einmal ein volles Neuindizieren probieren?

x-devonthinktogo://reindex

gps2003 · February 16, 2021, 2:57pm

Habe ich gestartet, aber bei 40000 Objekten dauert es ein wenig …

ovicula · February 16, 2021, 3:34pm

Funktioniert! Besten Dank!

eboehnisch · February 16, 2021, 4:28pm

Ah, wunderbar. Danke für die Rückmeldung.

chrk · February 16, 2021, 5:17pm

I am seeing the same issue, but the reindexing (x-devonthinktogo://reindex) doesn’t solve it. I have 112 PDFs on DTTG listed as without text, but my smart group on Mac lists 0. I manually checked 10 of the ones listed in DTTG and they are all showing up as PDF+Text on Mac and also have text selectable in DTTG, without running OCR.

eboehnisch · February 16, 2021, 5:57pm

Could you please open a support ticket and, maybe, share one or two of these documents if they are not of very personal nature?

chrk · February 16, 2021, 6:11pm

Sure. I just tried via DTTG, but got an error because I don’t have Mail installed. Contacting via Mac would be fine I guess? Or should the PDFs be shared from DTTG?

eboehnisch · February 17, 2021, 4:56pm

Thank you, received. They all work fine here. Is the reindexing definitely completed? Are these PDFs findable in DEVONthink To Go when searching for content? And do you, by any chance, remember how long ago you added them? Just exploring possible explanations.

chrk · February 17, 2021, 6:32pm

Thanks Eric.

Yes, reindexing finished. I just did it again and saw it counting down to zero from 8000+ items.

Yesterday, I also deleted the app and reinstalled. I chose CloudKit sync, because the Mac finally finished that (I used iCloud legacy before). After the reinstall, the number of PDFs without text increased to 138.

After the reindexing today, it still shows 138 PDFs without text.

Text from the PDFs is found on Mac in DT with a standard search.

The documents were added to DT on Mac on 2020-10-01, 2020-12-09 and 2020-12-06.

Further tests in DTTG reveal some reproducible behavior.

Selecting text from one of those listed PDFs works and can be copied. Pasting that copied text into the search field within the PDF, reveals the text and page as a result.

However, from the global search field – when the PDF is not opened in DTTG – that same copied text pasted into the global search and pressing enter reveals no results. Selecting a shorter phrase from the document, like a name and just 3-4 words instead of 1-2 sentences, finds the PDF.

This revealed that there might be an unrelated/additional issue in DTTG with the global search, especially when punctuation is part of the search string.
For example, searching the “Hypothesis” PDF I sent you for “Although the mechanisms involved in memory are still debated, they seem” reveals zero results from the global search, but inside the PDF, this phrase is found.
Now, when searching for “Although the mechanisms involved in memory are still debated they seem” or “Although the mechanisms involved in memory are still debated* they seem” (with an asterisk in place of the comma), the PDF is found via global search, indicating an issue with punctuation.
The same thing happens when searching for “University of Debrecen, Hungary” (no results) and “University of Debrecen* Hungary” (PDF is found).

Since this PDF was listed as without text in the smart group, I had to download it first in DTTG to test the things described above. Curiously, after doing this, I noticed that the number of PDFs without text decreased. Downloading all those files emptied the smart group, but the number remained.
Then, removing and re-adding the smart group also updated the number to 0.

Summary:

The entries in the PDFs without text smart group needed to be downloaded first for DTTG to recognize that they in fact contain text. Since I don’t download most of my files, there might be an issue with some documents in DTTG (<2% of all my files), falsely not being recognized as containing text.

Testing this revealed a possible issue with global search not finding text that searches within an opened PDF find. I can reproduce this when punctuation is part of the search string.

BLUEFROG · February 18, 2021, 7:58am

Only alphanumeric characters are indexed and searchable.

Blanc · February 18, 2021, 8:01am

Whilst that is ok, should the search not then ignore punctuation? (i.e. why search for something which cannot be found?)

eboehnisch · February 18, 2021, 10:18am

Interesting find about the comma. Technically, commas are not stored in the full-text index. We need to check whether we filter the comma also from the entered search term. @BLUEFROG, would you please open an issue for this? Thank you.

BLUEFROG · February 18, 2021, 10:33am

Done.

eboehnisch · February 18, 2021, 11:35am

That’s also an interesting bit of information. There is a specific situation where this could happen. We’re investigating. Thank you for testing this so thoroughly!

chrk · February 18, 2021, 1:33pm

Thanks for looking into that. It makes more sense to me if this was ignored. Otherwise, searching for quotes in DTTG would never work.