Copying text from .pdf across two pages

prob · March 1, 2023, 4:13am

I have an OCR-ed .pdf in DT. If I copy text that begins on one page and ends on the next page, and then paste that text into a document in any other app, I get an odd result: the text from the first page, then an image of the 2nd page, and then the text from the second page. I can delete the image of the second page and bring the two sections of text together, but I’m wondering if there’s some way to copy text that spans two pages WITHOUT it copying the image of the 2nd page.

Can this be done?

BLUEFROG · March 1, 2023, 5:41am

Development would have to assess this.

Pasting into a rich text document in TextEdit doesn’t paste an image of the second page.
There is some artifact when pasting into a document in Pages and Word though it’s not an image of the entire second page.
Mellel shows no issue and actually seems to be the best results when pasting.

and then paste that text into a document in any other app

PS: Please be precise and clear in your descriptions, e.g, any other app. There is no indication of what app you’re referring to and as shown above, TextEdit didn’t show an image of the second page when pasted.

prob · March 1, 2023, 6:33am

Hi Jim,

Your speedy reply made me take another look and I discovered this: it happens only when I’m copying and pasting text from a .pdf I have created with a scanner.

It doesn’t happen with a .pdf I’ve downloaded, or a .pdf I’ve created from a website using the Save as PDF or Export as PDF command. And it doesn’t happen if I use Clip to DEVONthink as a paginated PDF. In all those examples, I can copy and paste a selection that starts on one page and finished on another page with no problems.

Here’s what it looks like when I select text from two pages of the .pdf I’d dowloaded of Take Control of DEVONthink":

It works fine.

But if print those two pages out and then scan them with my ScanSnap iX500, then select text that spans two pages, this is what it looks like:

And if I paste that into Pages, you can see the image of the full 2nd page appears between the selected text from the 1st page and the selected text from the 2nd page.

So I guess my question is: is there any way to prevent this from happening from .pdfs I create with my scanner?

And thanks, as always, for your help.

BLUEFROG · March 1, 2023, 6:41am

I’ll have to test this in the morning. It’s a quarter to 2 and I need to sleep a little.

And you’re welcome

prob · March 1, 2023, 6:52am

Since when do you sleep?

rfog · March 1, 2023, 9:26am

He never sleeps.

A scanned PDF with OCR should have two layers: the graphic layer and the text layer, and normally the text layer is UNDER the graphic layer, then you see the original graphic image and the selection tool uses the lower layer to select text.

Then, when you change page, as the text layer of the new page is UNDER the original graphic image, you select the image plus the text (you can see it in one of your captures), and copy both.

And this is another of the zillion bugs of macOS PDF Framework, because same file selects right in PDF Expert in Mac and in Edge Chromium in Windows: the framework understands that you are selecting only text layer and does not select the graphic layer.

Another way to have scanned PDF is set text over image. This will clear the detected text in the image and put the OCRed text on top… with all the bad OCRed letters as garbage. To select this option you must be reasonably sure the text is completely valid. And there is a third option: text only, and you will get a PDF as similar as one generated by Word or other PDC creation capable programs, with same issue related to the bad OCRed text.

prob · March 1, 2023, 7:16pm

Thanks for the very informative reply. I get it, and will probably just stay with the status quo and delete the image when it pastes in.

On another subject, you’re so not wrong to hate OneNote. What were they thinking?

rfog · March 2, 2023, 9:15am

You’re welcome.

And yes, OneNote is a great idea but, as happens with all Microsoft implementations, is the poorest implemented one. Infinite canvas, 3-dimension document structure, collaboration… but it’s a bottomless pit where you throw all but cannot recover anything. And that is intentional to force you to use it forever.

But if at least worked fine, but I had a lot of sync issues, more in macOS, crashes, each time I opened a web capture printout it lost image quality until it was a graphic with tiny worms instead of letters.

And then comes the search: Microsoft search engine searches but does not find anything even if it is open in the same screen you are searching for, and I don’t mean search into OCR, I mean search into normal text.

I was using OneDrive since it first incarnation, but I had so many issues with it that I started searching for alternatives and, after test a lot of them (even some Open Source -I remember some fights with some KDE developers until they invited me to close the door from outside (*) ), I ended in DT/DTTG, and here I am since some years. And as it is said in Spanish: from the Chanquete’s boat nobody will remove us.

(*) And after some serious discussions about usability, crappy interfaces, bloated screens, they implemented all my ideas in Okular… after I was sent to hell!!!

chrillek · March 2, 2023, 9:37am

That’ll certainly get better with their implementation of ChatGPT