I’m sorry I must be a bit slow/confused this morning. I currently have 9 papers with annotations that I’m seeking to summarize. 5 of those 9 display these errors in the summary.
Since apple pdfkit is reported for breaking text layers, I always use a smart rule to backup the origin pdfs when import and tries to use another pdf editor like pdf export to highlight or in your case, split the pages. I also backup every hour if I have modified a document in the past hour by a small rule.
This is typical behavior when annotating protected PDF files or otherwise changing the content (removing/adding/merging pages). By protected files I mean in particular PDF/A or PDF files with a digital signature… In most cases you can use the attached Automator Workflow to convert a protected PDF file to “normal” PDF without destroying the text layer. In some cases you have to use OCR …
I don’t wish to sound ungrateful, but I feel stuck now with respect to effective use of DT. I realize that this isn’t a DT created problem but it takes out a large chunk of how I use DT.
I use DT for search, reading and highlighting. I can’t afford to import a paper now, tag and file it, only to discover months later it had a corrupt text layer. The partial solution I was offered via support was to OCR the file. Great it worked, but solution creates a new file in the inbox. I lose: tags, author, title changes, I even lose where it was filed. So now I fear importing new files into DT since I might have to rescan with OCR losing all the added value I create via tags etc.
What is the best solution?
The only one I have so far - OCR all new PDFs on the way into DT? *That doesn’t solve the 1000+ pdfs that are already in DT, tagged, filed etc.
What are you using to OCR? From a quick test when I select OCR > To Searchable PDF from the right-click menu it takes all the title, tags, author and other metadata to the next file. It is not copying the linked annotation to the new document, but that is what you’d want in my opinion.
A better suggestion would be to make a simple Smart Rule for your group and select “OCR” as the action. It will then simply OCR the documents ‘in place’ and keep everything already intact and replace your document with the OCR-ed version as well. Also: it will maintain the UUID - so any links already made to that item will stay intact.
I check PDFs when I import them that the OCR layer is ok. If it’s not, I OCR in DT before I do anything else with the pdf. That way I know everything in my database has good ocr. (For your existing files you can do a smart group to find any files with bad ocr.
Someone else asked about this recently (unless it was this thread and I’d forgotten), and I wanted to test what happens when you OCR a file that already has highlights in it. Have you tried it? I just did, and it duplicates the highlights, but they’re very slightly offset from the text, so you can’t run the summarise highlights option as it’s incomplete sentences. Bit odd.
I guess OCRing adds the text it detected as new PDF content at the place where it detected the text. Which works more or less ok, if the software can figure out a font that is close enough to the one used in the PDF. And if it is able to determine the font size so that the letters generated in the text layer have the same shape and size as those in the original.
If that is what happens, it is of course not possible to match the text layer exactly to the underlying original in each and every situation. And I can of course be wrong and the OCR software does something else entirely.
Last question (I hope) in this sequence - does OCR’ing the document also take of PDFKIt bug from earlier problem: Automatically take ReadOnly PDFs -> ReadWrite - #9 by mlevison where fonts included in a document, triggered a PDFKit bug which caused DT to mark the files as not editable? (if you recall our previous work around was to Data > Convert > to Paginated PDF).
If so this may well turn out to be an elegant if disk space consuming solution.