PDF+Text vs. PDF-Document

I have already searched the forum, but I find only older posts and hope hereby brief clarification.
When I send a document to DT with my ScanSnap directly on the MacBook, it automatically creates a PDF+Text document. If I scan via the mobile phone and put it into the inbox (unfortunately I don’t know any automatism there either) I “only” have a PDF document.

The PDF document in my example is 1.4 MB in size. The PDF+text, on the other hand, is only 80 KB in size.
The quality of PDF text is of course significantly worse due to the size alone. But what else are the differences?
I can mark the text in the PDF in both cases; so the DT search works perfectly.

The only difference besides the quality is that in the bar on the right, the words are not counted for “PDF document”, but they are for “PDF+text”.

So where is the real difference?
Unfortunately, I did not find anything in the manual either.

What do you use the most - and why?

Answers in German are welcome as well :slight_smile:

PDF+Text has a text layer which was indexed (and therefore the full text search & concordance can use the indexed text), PDF document doesn’t (e.g. if the PDF contains only graphics/images).

Wow, I love this forum and how quickly there are always answers here, thank you!

If that’s not a full text search on the PDF document, what is it? Because I can still select the text and also search normally. When I type a word in the search at the top of DT, it also finds results in PDF documents (so I’m not talking about PDF+text).
Or does it have to search again each time and PDF+Text has an index that speeds up the search?

Sorry for my dumb questions.
In my mind, PDF+Text would have to be larger in storage volume than a PDF document.
Or is compression always started when converting to PDF+Text?

The only difference is really whether DEVONthink indexed the text layer or not. This doesn’t have any impact on file sizes and selecting text might still be possible due to the live text feature of recent macOS releases. Everything else is impossible to tell without a copy of the document.

Okay, thanks. But when I’m converting a PDF-Document I just make a right click → convert to → PDF+Text
After that I receive a duplicate document (PDF+Text) and the size of the file is reduced, compared to the PDF Document

2023-02-24_e.on (PDF-Document).pdf (3.0 MB)
2023-02-24_e.on (PDF+Text).pdf (1007.9 KB)

An OCR’d PDF is never going to be the same size as the original. In some cases, it can be larger; sometimes smaller. Also compression can be en/disabled in Preferences > OCR.

And enabling Original Document: Move To Trash in the same preferences will put the unOCR’d original in the database’s Trash.

Thanks again :slight_smile:
The compression is disabled, that’s why I’m wondering about the reduced size of PDF+Text.

I wasn’t aware about Original Document: Move To Trash is working for individual converted documents, I thought it is only about incoming/imported documents.
I’ll activate it, thanks :slight_smile:

You’re welcome :slight_smile:
Each page is processed and saved individually then collated back into a finished file.

PS: A smaller (or larger) file isn’t indicative of its quality.

1 Like

I have a few “PDF documents” that for some inexplicable reason are not being recognized as PDF+Text in DT.

For what reason might an indexed PDF file with text (not OCR but rather genuine digital-native text) show in DT as a “PDF document” instead of “PDF+Text,” and is there a way to make DT recognize that there’s text, without reimporting/converting it into a new item?

What’s the problem with converting or reimporting the document?

It’s an indexed file for a reason. These particular files are organized automatically in the bowels of another application. It’s great that DEVONthink has the index capacity so that I can use both systems simultaneously. But it will be a headache to deal with duplicates in the other file manager that I was hoping to avoid. Manageable, but not ideal.

What other “file manager”, noting “file management” is not DEVONthink’s core purpose? And why would you have to deal with duplicates?

neither of them are file managers, but they both manage files. I’m talking about Zotero. The problem with converting and reimporting the document is that it creates a different document, rather than modifies the existing one.

It’s not that difficult for me to manually reimport and reset the file organization manually for one or two files. But I’m looking at more than just a few files with this issue.

Probably because it does not have a text layer. Text that you see in a PDF can, eg, be simply a TIFF or JPEG image or a sequence of move and stroke commands to create the impression of letters. None of these things is „text“.

1 Like

Not these. Every other PDF reader app can recognize the text in these and let me interact with it.

This might be just “live text” on the latest macOS versions, meaning that there’s indeed no text layer.

1 Like

That’s not the case here. I’m 100% confident that the document contains text, and it’s not merely the result of Live Text. I’ve used other programs to extract/read the embedded text from the document as plain text. Even DT lets me interact with the PDF’s embedded text in some respects, but not fully. I can use my cursor to highlight text, but I can’t copy it or search within it.

Can you post the doc here for others to look at?

Sounds like either a permission issue or indeed live text. Hard to tell without the document.

Debord 2021 - Society of the Spectacle.pdf (4.0 MB)

Here’s one such example, attached.