Force re-scan indexed PDF in DTTG3b3?

rfog · June 24, 2019, 6:17am

Recently I’ve removed and added a lot of (same) PDF but, as happened sometimes in DT2 but with a lot of less frequency, some PDF are recognized as “PDF Document” instead of “PDF+Text”. And no, it is not one do the macOS PDF sh*t bugs), it is a recognizable PDF by DT because if I remove it, and add it, then it is marked as “PDF+Text”.

Is there any way to, without having to remove and add them, force DT to recognize them? Currently I have 71 of this PDF distributed across some dozens of folders…

PS: for the curious, yes, it is the full National Geographic Magazine until 2018, scanned and OCRd from a zillion of sources.

BLUEFROG · June 24, 2019, 12:59pm

No, but you could start a support ticket and attach a file to test.

rfog · June 24, 2019, 1:14pm

The thing is I cannot attach anything because it only happens when indexing some hundreds of PDF and the affected PDF are random ones.

BLUEFROG · June 24, 2019, 3:22pm

PDFs are not created equally. In fact there are probably more bad PDFs in the universe than any other format.

rfog · June 24, 2019, 5:43pm

Of course, but I’m talking about the same PDF. On bulk indexing, same PDF is indexed as with text sometimes and sometimes not. Even if you remove the offending PDF, then add the same PDF, you get it indexed as with text.

BLUEFROG · June 24, 2019, 6:20pm

Development would have to assess this. Sample problem files would be helpful.

rfog · June 25, 2019, 8:51pm

Sorry for the delay. A lot of work.

You can find some samples in the attached URL. But as said, not recognized PDF are at random.

https://1drv.ms/u/s!ArnzyWtYu8jx2tZuqCc_XPL3pfFkng?e=UdusTJ

BLUEFROG · June 25, 2019, 11:36pm

I didn’t see any issue with those PDFs, other than them being very large. All detected as PDF+Text.

rfog · June 26, 2019, 9:05am

That is the thing I try to tell you. When indexing large amount of PDF, sometimes, on random PDF, some PDF are marked as without text but they contain text.

Good and right PDF. But not recognized as having text. To me, as developer, could be some kind of interlock issue between the queue that reads the PDF and the queue that analyze them, or similar issue. In DT2, this happened so and then, in DT3 happens with more frequency. Difference between DT2 and DT3 is now DT3 en-queue the PDF and don’t do them one after other in a modal dialog. But of course is a guess and a low level priority issue.

cgrunenberg · June 28, 2019, 1:03pm

The indexing is performed by a background task so that third-party code, e.g. Spotlight importers or macOS, can’t crash or freeze DEVONthink. Therefore this might be a timeout issue, e.g. especially if DEVONthink is in the background, the documents are huge and the active app uses a lot of CPU time.

Please choose Help > Report Bug while pressing the Alt modifier key and send the result to cgrunenberg - at - devon-technologies.com - thanks!

prunk · June 23, 2023, 12:45am

I’m having the same issue. I’ve moved a large amount of PDFs to another database and a bunch of them are showing up as “PDF document” instead of “PDF+Text” (a lot without thumbnails as well), but they were “PDF+Text” in a previous database. They are re-indexed correctly if I move them out to Finder and import them again, but I don’t want to do this individually with dozens of files in different groups.

Was an option to force a re-index of those files ever implemented? I’m on version 3.9.1.

cgrunenberg · June 23, 2023, 6:17am

How did you actually move them? Directly via e.g. drag & drop or the Move To contextual menu? In this case the search index, metadata and type are all moved & retained.