Exporting index from PDF document to create an index name

I need to generate an name index from a PDF document…
Is it useful the embedded index that I create in my PDF doc?
is DT in any way useful in this kind of job?

Additional details (and maybe an example document or screenshots) would be useful, currently it’s unclear what you’re trying to achieve. Thank you.

I am finishing a book and I have it as PDF; I generate from within Acrobat the so called Embedded Index that AFAIK is for search inside the PDF itself…
My question is: "is it possible to use this Embedded Index to generate a Name Index?
any other way to generate a Name Index from a pdf?

1 Like

That one is easier to answer: No. Although, I’m not sure that I know what a “Name Index” is. I suppose, it’s a collection of names at the end of the document with references to page numbers etc.

I found this thread

which seems to indicate that the embedded index uses a binary format proprietary to Adobe. Therefore, I wouldn’t get my hopes up to use this index in DT

and if it is a txt file read by DT?
is it possible -for instance- to extract only words that start with a capital letter?

Using a script: yes. But that will not tell you where the words occur in the PDF. Maybe you could explain a bit what you’re trying to achieve?

Correction Extracting all words starting with a capital can be done with grep on the command line. For example

tr ' ' '\n' < theFile | grep '[A-Z][a-z]'

will output all words contained in theFile beginning with an uppercase letter between A and Z. Note that this might not give what you want for a lot of languages! But at least you don’t need to script anything, only a bit of shell tool massage.

Thanks chrillek!
But… this script can show the words beginning by Capital letter or, in whatever way, it can “copy and paste” them?
for instance if I have

Guanda, 108
guarda, 174, 297
Guardando, 292
guardando, 266

can it shows only

Guanda, 108
Guardando, 292

thanks again

Can you please take the time to describe what you have and what you want to achieve in more detail?
Right now, it seems to be a moving target: first you only want capitalized words, now you want lines beginning with a capitalized word. Perhaps.

I’m not eager to think about an ever-changing scenario.

I will try…
I have a draft of a book as PDF.
I was able to extract from that PDF a list of ALL the words (and the pages where they occur) it contains (via a Windows software…);
from that list I would like to build a Name index (or an index by name) limiting the useful (for me) words to the Capitalized words (names and proper names) adopt a first filtering to the huge (!) amount of single word

Was I a bit clearer?

First: This has no relation to DT. I’ll suggest a (in my mind) possible way, but deeper discussions should be moved elsewhere.

So let’s suppose you have a text file containing the word list (in fact, words with page numbers, each word/page number on its own line) like /Users/<you>/Desktop/index.txt. Then

grep '^[[:upper:]][[:lower:]]' "/Users/<you>/Desktop/names.txt" > "/Users/<you>/Desktop/names.txt"

in the Terminal should do the trick. Afterwards, all lines in “index.txt” beginning with uppercase characters and followed by at least one lowercase character will be in “names.txt”.

This works for all Unicode upper and lower case characters, regardless of script and accentuation (where that makes sense, of course. Many languages only have one case)


First al all… THANKS! it works very well…
I apologise for the off topic, but… could this script enrich the WordService app?

That’s not for me to answer. But a trivial one-live like that should probably just stay what it is – a trivial one-liner.

1 Like

You can’t customize or extend WordService. But it’s possible to create services via Automator.