I can share the Markdown file, however the three pdfs that I have that suffer from the problem are all copyright and can’t be shared. (I’m sure I could send to DT staff especially since the one example is from another DT User David of Oxford Review).
Why is that these characters can’t read? (Text recognition error?). What do I need to tell the people who create these files? I ask because this reduces the utility of PDFs and highlighting.
This is a pdf problem but DT gives you the tools to fix it. The OCR (the magic bit that renders the text layer in PDFs and other files) is wrong so you’re getting missing characters in your text. DevonThink Pro has OCR and can re-do this for you. I’m not at my Mac right now but if you right-click on the affected file in the mouse menu you should see an option for OCR, and I think it opens a sub menu that has “convert to new OCR” or something like that.
Clicking this makes a new PDF with a new text layer and the problem should be solved.
I don’t know what happens to the highlights you’ve already done though, whether they are also copied to the new file.
I check that the OCR is correct when I import older pdf files now so I don’t get caught out (modern ones seem to fare better).
Edited to add: I can’t screengrab the menu because I’m not at my Mac, but here is the instruction in the Take Control book:
Since apple pdfkit is reported for breaking text layers, I always use a smart rule to backup the origin pdfs when import and tries to use another pdf editor like pdf export to highlight or in your case, split the pages. I also backup every hour if I have modified a document in the past hour by a small rule.
This is typical behavior when annotating protected PDF files or otherwise changing the content (removing/adding/merging pages). By protected files I mean in particular PDF/A or PDF files with a digital signature… In most cases you can use the attached Automator Workflow to convert a protected PDF file to “normal” PDF without destroying the text layer. In some cases you have to use OCR …
I don’t wish to sound ungrateful, but I feel stuck now with respect to effective use of DT. I realize that this isn’t a DT created problem but it takes out a large chunk of how I use DT.
I use DT for search, reading and highlighting. I can’t afford to import a paper now, tag and file it, only to discover months later it had a corrupt text layer. The partial solution I was offered via support was to OCR the file. Great it worked, but solution creates a new file in the inbox. I lose: tags, author, title changes, I even lose where it was filed. So now I fear importing new files into DT since I might have to rescan with OCR losing all the added value I create via tags etc.
What is the best solution?
The only one I have so far - OCR all new PDFs on the way into DT? *That doesn’t solve the 1000+ pdfs that are already in DT, tagged, filed etc.
What are you using to OCR? From a quick test when I select OCR > To Searchable PDF from the right-click menu it takes all the title, tags, author and other metadata to the next file. It is not copying the linked annotation to the new document, but that is what you’d want in my opinion.
A better suggestion would be to make a simple Smart Rule for your group and select “OCR” as the action. It will then simply OCR the documents ‘in place’ and keep everything already intact and replace your document with the OCR-ed version as well. Also: it will maintain the UUID - so any links already made to that item will stay intact.
I check PDFs when I import them that the OCR layer is ok. If it’s not, I OCR in DT before I do anything else with the pdf. That way I know everything in my database has good ocr. (For your existing files you can do a smart group to find any files with bad ocr.
Someone else asked about this recently (unless it was this thread and I’d forgotten), and I wanted to test what happens when you OCR a file that already has highlights in it. Have you tried it? I just did, and it duplicates the highlights, but they’re very slightly offset from the text, so you can’t run the summarise highlights option as it’s incomplete sentences. Bit odd.
I guess OCRing adds the text it detected as new PDF content at the place where it detected the text. Which works more or less ok, if the software can figure out a font that is close enough to the one used in the PDF. And if it is able to determine the font size so that the letters generated in the text layer have the same shape and size as those in the original.
If that is what happens, it is of course not possible to match the text layer exactly to the underlying original in each and every situation. And I can of course be wrong and the OCR software does something else entirely.