How to extract text in the first page of a pdf?

ngan · February 19, 2020, 5:04pm

Working on refining my scripts at spare time. I am only aware of the command to extract the full text of a pdf. How can I extract only the text on the first page of a pdf? I understand that a workaround is to get the first n paragraphs of the rich text content, but get the text in the first page is my goal…

Thank you in advance.

	set {a} to item 1 of {selection}
	set b to (rich text of a)

cgrunenberg · February 20, 2020, 8:12am

DEVONthink doesn’t support scripting of PDF pages, a third-party tool might be able to do this.

ngan · February 20, 2020, 9:06am

Thanks. It’s OK, I’ll extract the first n hundreds of words OR first n paragraphs in the text content of the pdf as a proxy - it’s good enough for me.
I am reluctant to use too many different tools to achieve any task. I think DT3 + Better Touch Tool + Text Expander are already giving me almost all I need given that all my tasks are within DT3. The rest is just adjusting the workflow and find a workaround by AppleScript.

kewms · March 27, 2020, 12:23am

Related to this, is it possible to use a content-driven query to set the boundaries of the extracted text? For instance if I want to extract the text that lies between “Abstract” and “Introduction?”

(Please be gentle. My scripting skills are rudimentary at best.)

cgrunenberg · March 27, 2020, 7:09am

The next release will support this via smart rules actually:

This example would rename the item using the extracted text.

kewms · March 27, 2020, 4:09pm

Oooh, nice!

What I actually want to do is extract the Abstract (and title) out to a separate document, with an eye toward manipulating a folder full of abstracts with other tools.

Katherine

BLUEFROG · March 27, 2020, 4:27pm

That would likely require scripting to accomplish.

kewms · March 27, 2020, 9:12pm

I suspected as much, hence my original question.

Katherine

BLUEFROG · March 27, 2020, 9:13pm

Can you post a few screen captures of documents you’d be processing ?

kewms · March 28, 2020, 12:36am

Sure. At least for the immediate need, they’re all from the same conference and follow the same format. These are recent, so they all come with a text layer. (OCR not needed.)

Katherine