PDF Index Generation

rkaplan · May 28, 2020, 12:49pm

When reviewing long PDF files, it would be helpful to have an index of all words and perhaps of specific phrases I define in advance. Though I could search for these, a pre-done index could be notably more efficient in many cases.

Ideally I would set up an automated process where any PDF I place in a given location is automatically OCR’d and then indexed.

Has anyone done this?

I found the software below and tried it out. The plus side is that it is extremely fast and does exactly what I am looking for. Its user interface is not modern so it cannot be activated in a context menu based on file extension, and it is not AppleScriptable; however, it can be run from the command line, with a pretty sophisticated set of configuration options in that manner.

So I believe it would be possible to automate adding an index to any PDF in DT3 by writing Applescript which executes command line instructions.

Before I attempt this - has anyone else already done this so I do not reinvent the wheel? Any thoughts on whether this is the best software to try it with, or are there alternatives with Applescript support or other superior features?

BLUEFROG · May 28, 2020, 1:04pm

Tools > Inspectors > Concordance has a list of words - though not phrases - in the current document.

zeitlings · May 28, 2020, 1:07pm

Did you consider this built-in feature?

This should satisfy the purpose of an index to some extend I think (if you select a word you can skip through the document to each occurrence). Perhaps an addition to this feature would make sense: filter by lexical class, e.g. just list the nouns. The NaturalLanguage SDK should manage that

rkaplan · May 28, 2020, 1:25pm

I did consider that. A major difference is that with dedicated indexing software I can specify phrases to include in the index with this software. So for example if it is a medical record I can index “MRI CSPINE” and “MRI Knee” which is much more efficient and helpful than an index for these words individually or the need to create individual searches for these combinations.

That aside, the appearance of the index is much easier to read for an overview - and again this can be pared down to only specific pertinent phrases or words if desired.

DT3 is incredible software, but I think an app dedicated only to indexing can do a better job. Of course DT3’s strength is also scriptability/customization so I am trying to figure out the best way to leverage that and integrate it with some software optimized to the indexing task.

(I am sure you could add better indexing to DT3 by the way - but it would be so so so much more appreciated if you could instead devote that time to displaying rich text custom metadata fields in list view! Now that is something super useful that I cannot add via scripting.)