Incomplete indexing of large pdf's

bobbob · December 5, 2020, 5:21am

I used DTpro3 to index a folder containing pdf’s of up to 13,000 pages. It appears DT (like Mac’s Spotlight) does not index pdfs past something like 1,000 pages.

Is there a way to force DT to index the entirety of all pdf files regardless of size? (Is there a way to force Spotlight to do so?) If so, what is the way?

Thanks
Dan

BLUEFROG · December 5, 2020, 6:08pm

Where are you getting PDFs that large?

bobbob · December 6, 2020, 8:15pm

Medical records and books, mainly.

I could solve the problem by breaking files up into chunks, but in my circumstances that would create disadvantages I’d like to avoid.

Incidentally, Foxtrot Pro does not have this problem if you switch from reliance on Spotlight indexing of pdfs and instead switch to using xpdf.

rkaplan · December 6, 2020, 8:21pm

Interesting - where is the config setting for xpdf?

bobbob · December 6, 2020, 8:46pm

If you start up Foxtrot using command/option, it’s in the dialogue box that pops up.
See https://help.foxtrot-search.com/600-pdf-importer

BLUEFROG · December 7, 2020, 6:04am

Development would have to assess this.

A link to or an example document would be helpful.

cgrunenberg · December 7, 2020, 9:21am

DEVONthink uses a background task to index certain documents (e.g. PDF) and a timeout to avoid that e.g. corrupted documents can stall the indexing or could crash the main app. Therefore the only limits are the timeout and the speed of your computer. Are any of these giant PDFs downloadable/public?

rkaplan · December 7, 2020, 10:41am

FWIW - I regularly work with PDFs on the order of thousands of pages (also medical records). I have found that the long OCR and other processing of these files may at times slow down execution of smart rules; also it is not a good idea to simultaneously attempt other CPU-intensive tasks in DT3 while such large files are being processed. But that said I have never seen a page limit to the size of indexing such files.