PDF+Text vs. PDF-Document

Plip · February 24, 2023, 9:59am

I have already searched the forum, but I find only older posts and hope hereby brief clarification.
When I send a document to DT with my ScanSnap directly on the MacBook, it automatically creates a PDF+Text document. If I scan via the mobile phone and put it into the inbox (unfortunately I don’t know any automatism there either) I “only” have a PDF document.

The PDF document in my example is 1.4 MB in size. The PDF+text, on the other hand, is only 80 KB in size.
The quality of PDF text is of course significantly worse due to the size alone. But what else are the differences?
I can mark the text in the PDF in both cases; so the DT search works perfectly.

The only difference besides the quality is that in the bar on the right, the words are not counted for “PDF document”, but they are for “PDF+text”.

So where is the real difference?
Unfortunately, I did not find anything in the manual either.

What do you use the most - and why?

Answers in German are welcome as well

cgrunenberg · February 24, 2023, 10:19am

PDF+Text has a text layer which was indexed (and therefore the full text search & concordance can use the indexed text), PDF document doesn’t (e.g. if the PDF contains only graphics/images).

Plip · February 24, 2023, 10:24am

Wow, I love this forum and how quickly there are always answers here, thank you!

If that’s not a full text search on the PDF document, what is it? Because I can still select the text and also search normally. When I type a word in the search at the top of DT, it also finds results in PDF documents (so I’m not talking about PDF+text).
Or does it have to search again each time and PDF+Text has an index that speeds up the search?

Sorry for my dumb questions.
In my mind, PDF+Text would have to be larger in storage volume than a PDF document.
Or is compression always started when converting to PDF+Text?

cgrunenberg · February 24, 2023, 10:36am

The only difference is really whether DEVONthink indexed the text layer or not. This doesn’t have any impact on file sizes and selecting text might still be possible due to the live text feature of recent macOS releases. Everything else is impossible to tell without a copy of the document.

Plip · February 24, 2023, 10:47am

Okay, thanks. But when I’m converting a PDF-Document I just make a right click → convert to → PDF+Text
After that I receive a duplicate document (PDF+Text) and the size of the file is reduced, compared to the PDF Document

2023-02-24_e.on (PDF-Document).pdf (3.0 MB)
2023-02-24_e.on (PDF+Text).pdf (1007.9 KB)

BLUEFROG · February 24, 2023, 10:56am

An OCR’d PDF is never going to be the same size as the original. In some cases, it can be larger; sometimes smaller. Also compression can be en/disabled in Preferences > OCR.

And enabling Original Document: Move To Trash in the same preferences will put the unOCR’d original in the database’s Trash.

Plip · February 24, 2023, 11:02am

Thanks again
The compression is disabled, that’s why I’m wondering about the reduced size of PDF+Text.

I wasn’t aware about Original Document: Move To Trash is working for individual converted documents, I thought it is only about incoming/imported documents.
I’ll activate it, thanks

BLUEFROG · February 24, 2023, 11:09am

You’re welcome
Each page is processed and saved individually then collated back into a finished file.

PS: A smaller (or larger) file isn’t indicative of its quality.

aaaaaaaaaaaaaaaa · March 27, 2024, 5:38am

I have a few “PDF documents” that for some inexplicable reason are not being recognized as PDF+Text in DT.

For what reason might an indexed PDF file with text (not OCR but rather genuine digital-native text) show in DT as a “PDF document” instead of “PDF+Text,” and is there a way to make DT recognize that there’s text, without reimporting/converting it into a new item?

BLUEFROG · March 27, 2024, 5:55am

What’s the problem with converting or reimporting the document?

aaaaaaaaaaaaaaaa · March 27, 2024, 5:58am

It’s an indexed file for a reason. These particular files are organized automatically in the bowels of another application. It’s great that DEVONthink has the index capacity so that I can use both systems simultaneously. But it will be a headache to deal with duplicates in the other file manager that I was hoping to avoid. Manageable, but not ideal.

BLUEFROG · March 27, 2024, 6:08am

What other “file manager”, noting “file management” is not DEVONthink’s core purpose? And why would you have to deal with duplicates?

aaaaaaaaaaaaaaaa · March 27, 2024, 6:22am

neither of them are file managers, but they both manage files. I’m talking about Zotero. The problem with converting and reimporting the document is that it creates a different document, rather than modifies the existing one.

It’s not that difficult for me to manually reimport and reset the file organization manually for one or two files. But I’m looking at more than just a few files with this issue.

chrillek · March 27, 2024, 7:15am

Probably because it does not have a text layer. Text that you see in a PDF can, eg, be simply a TIFF or JPEG image or a sequence of move and stroke commands to create the impression of letters. None of these things is „text“.

aaaaaaaaaaaaaaaa · March 27, 2024, 7:32am

Not these. Every other PDF reader app can recognize the text in these and let me interact with it.

cgrunenberg · March 27, 2024, 8:20am

This might be just “live text” on the latest macOS versions, meaning that there’s indeed no text layer.

aaaaaaaaaaaaaaaa · March 30, 2024, 5:33am

That’s not the case here. I’m 100% confident that the document contains text, and it’s not merely the result of Live Text. I’ve used other programs to extract/read the embedded text from the document as plain text. Even DT lets me interact with the PDF’s embedded text in some respects, but not fully. I can use my cursor to highlight text, but I can’t copy it or search within it.

rmschne · March 30, 2024, 6:20am

Can you post the doc here for others to look at?

cgrunenberg · March 30, 2024, 8:45am

Sounds like either a permission issue or indeed live text. Hard to tell without the document.

aaaaaaaaaaaaaaaa · March 30, 2024, 9:14am

Debord 2021 - Society of the Spectacle.pdf (4.0 MB)

Here’s one such example, attached.