Summarize Highlights from PDF creates alot of unreadable characters

mlevison · May 5, 2022, 6:53pm

I’m sure this is a PDF problem, yet it’s in DT I see the so I need to start asking here. When I summarize a PDF with highlights I often get unreadable characters in the Markdown/RichText files.

Example

Where the source PDF had this:

I can share the Markdown file, however the three pdfs that I have that suffer from the problem are all copyright and can’t be shared. (I’m sure I could send to DT staff especially since the one example is from another DT User David of Oxford Review).

Why is that these characters can’t read? (Text recognition error?). What do I need to tell the people who create these files? I ask because this reduces the utility of PDFs and highlighting.

MsLogica · May 5, 2022, 8:23pm

This is a pdf problem but DT gives you the tools to fix it. The OCR (the magic bit that renders the text layer in PDFs and other files) is wrong so you’re getting missing characters in your text. DevonThink Pro has OCR and can re-do this for you. I’m not at my Mac right now but if you right-click on the affected file in the mouse menu you should see an option for OCR, and I think it opens a sub menu that has “convert to new OCR” or something like that.

Clicking this makes a new PDF with a new text layer and the problem should be solved.

I don’t know what happens to the highlights you’ve already done though, whether they are also copied to the new file.

I check that the OCR is correct when I import older pdf files now so I don’t get caught out (modern ones seem to fare better).

Edited to add: I can’t screengrab the menu because I’m not at my Mac, but here is the instruction in the Take Control book:

mlevison · May 5, 2022, 8:40pm

@MsLogica Thanks. Data OCR works.

Wow that is weird. I’m hoping that either @BLUEFROG or @cgrunenberg can comment. Many of the PDFs we’re discussing were run through: Data → Convert → PDF (Paginated) - after this discussion: Automatically take ReadOnly PDFs -> ReadWrite

It’s interesting the DT Convert leaves behind unreadable characters but Data → OCR gets them right.

cgrunenberg · May 6, 2022, 7:04am

That’s a known issue of the PDFkit which might sometimes corrupt text layers.

mlevison · May 6, 2022, 12:59pm

Yes but why Data → OCR gets them right.

cgrunenberg · May 6, 2022, 1:00pm

OCR creates a completely new text layer.

mlevison · May 6, 2022, 1:29pm

So by that logic should I never use: Data → Convert → PDF (Paginated)

Further should I write a rule on import that does Data → OCR on all PDFs?

cgrunenberg · May 6, 2022, 1:33pm

See above. The PDFkit might corrupt the text layer, it doesn’t usually.

No, that’s unnecessary.

mlevison · May 6, 2022, 2:33pm

I’m sorry I must be a bit slow/confused this morning. I currently have 9 papers with annotations that I’m seeking to summarize. 5 of those 9 display these errors in the summary.

One batch of these papers are the Oxford Review papers that got discussed: Automatically take ReadOnly PDFs -> ReadWrite - however its not the only source of these errors.

Since this renders the annotations/highlighting next to useless (for me), I seek to find a way to automatically avoid in the future.

Is there a clever way on import to discover which documents are broken and OCR them?

BLUEFROG · May 6, 2022, 3:09pm

Open a support ticket and ZIP and attach a problematic file to inspect.

ryan_cine · May 7, 2022, 12:03pm

Since apple pdfkit is reported for breaking text layers, I always use a smart rule to backup the origin pdfs when import and tries to use another pdf editor like pdf export to highlight or in your case, split the pages. I also backup every hour if I have modified a document in the past hour by a small rule.

mlevison · May 7, 2022, 2:05pm

I awaiting the forsenic analysis from @BLUEFROG (not expected the weekend). Once he helps me understand I can decide what to do.

Ideas:

Rescan all PDFs using OCR on import and delete the original?
Use the built in versioning smart rule to version on changes?
…

It largely depends on where the problem lies? is in on scanning on import? Is use of the preview app? …?

tjur · May 9, 2022, 6:33pm

This is typical behavior when annotating protected PDF files or otherwise changing the content (removing/adding/merging pages). By protected files I mean in particular PDF/A or PDF files with a digital signature… In most cases you can use the attached Automator Workflow to convert a protected PDF file to “normal” PDF without destroying the text layer. In some cases you have to use OCR …

PDF Flatten (single input).workflow.zip (513.7 KB)

mlevison · May 19, 2022, 9:17pm

I don’t wish to sound ungrateful, but I feel stuck now with respect to effective use of DT. I realize that this isn’t a DT created problem but it takes out a large chunk of how I use DT.

I use DT for search, reading and highlighting. I can’t afford to import a paper now, tag and file it, only to discover months later it had a corrupt text layer. The partial solution I was offered via support was to OCR the file. Great it worked, but solution creates a new file in the inbox. I lose: tags, author, title changes, I even lose where it was filed. So now I fear importing new files into DT since I might have to rescan with OCR losing all the added value I create via tags etc.

What is the best solution?

The only one I have so far - OCR all new PDFs on the way into DT? *That doesn’t solve the 1000+ pdfs that are already in DT, tagged, filed etc.

mdbraber · May 20, 2022, 7:05am

What are you using to OCR? From a quick test when I select OCR > To Searchable PDF from the right-click menu it takes all the title, tags, author and other metadata to the next file. It is not copying the linked annotation to the new document, but that is what you’d want in my opinion.

A better suggestion would be to make a simple Smart Rule for your group and select “OCR” as the action. It will then simply OCR the documents ‘in place’ and keep everything already intact and replace your document with the OCR-ed version as well. Also: it will maintain the UUID - so any links already made to that item will stay intact.

MsLogica · May 20, 2022, 6:18pm

I check PDFs when I import them that the OCR layer is ok. If it’s not, I OCR in DT before I do anything else with the pdf. That way I know everything in my database has good ocr. (For your existing files you can do a smart group to find any files with bad ocr.

MsLogica · May 20, 2022, 6:28pm

Someone else asked about this recently (unless it was this thread and I’d forgotten), and I wanted to test what happens when you OCR a file that already has highlights in it. Have you tried it? I just did, and it duplicates the highlights, but they’re very slightly offset from the text, so you can’t run the summarise highlights option as it’s incomplete sentences. Bit odd.

chrillek · May 20, 2022, 6:40pm

I guess OCRing adds the text it detected as new PDF content at the place where it detected the text. Which works more or less ok, if the software can figure out a font that is close enough to the one used in the PDF. And if it is able to determine the font size so that the letters generated in the text layer have the same shape and size as those in the original.

If that is what happens, it is of course not possible to match the text layer exactly to the underlying original in each and every situation. And I can of course be wrong and the OCR software does something else entirely.

mlevison · May 20, 2022, 8:51pm

Curiously @BLUEFROG replied to me earlier in the day with a modified SmartRule that used the OCR Apply action. This of course does the OCR in place.

I’ve created two versions of this rule one to handle the case where I tag a file for a bad text layer:
BadTextLayer_Tag_-_OCR_Potentially_Corrupt_PDF_text

and another (not illustrated here) to fire on import, when there is a PDF.

BLUEFROG · May 21, 2022, 3:29pm

Note you can drag and drop files on a smart rule to use the rule’s actions on them. So you’d really only need the one rule.