Copy/Paste text not displaying certain letters

samrose · November 29, 2020, 6:52pm

I am not sure if this is an issue with DT or my mac, but when I copy text from a pdf that is opened in the DT pdf viewer and then paste it into an annotation document (or any rtf file), some letters are not translating. For example, effectively is coming out as e"ectively. Do you know of anyway to solve this? It has been going on since before the last update.

BLUEFROG · November 29, 2020, 7:02pm

I suggest you convert the PDF to plain text and check the underlying text.

samrose · November 29, 2020, 7:33pm

Thanks for the reply. What would I check for in the underlying text and what would I do to fix the issue for other pdfs? I don’t want to convert all of my pdfs to plain text, and this happens across many files.

BLUEFROG · November 29, 2020, 7:48pm

You said it pasted e"ectively, so I’d suggest looking for that.

pete31 · November 29, 2020, 7:48pm

This script copies plain text of one selected record

-- Copy plain text 

tell application id "DNtp"
	try
		set theRecords to selected records
		if theRecords = {} or (count theRecords) > 1 then error "Please select one record."
		
		set theRecord to item 1 of theRecords
		set theText to plain text of theRecord
		set the clipboard to theText
		display notification "Copied"
		return theText
		
	on error error_message number error_number
		if the error_number is not -128 then display alert "DEVONthink" message error_message as warning
		return
	end try
end tell

It’s been a while since I read it but maybe you’ll find good info in this thread.

BLUEFROG · November 29, 2020, 7:50pm

There is no need for a script in this situation.
Data > Convert > to Plain Text and checking the generated file is sufficient.

samrose · November 29, 2020, 7:59pm

Yes, when I convert to plain text, the errors are in the plain text document. For example ’ di#cult’, ‘de!nition’, ‘di"erent’, ‘$ood’, ‘!eld’ (field). I find it odd that it is replacing the f with different symbols, and in some cases, replacing ‘fi’ (in field). Is this just a problem with certain pdfs? I don’t know much about plain text conversion.

I now understand why it is coming up recently, as I recently started using the paste to plain text command as my default paste option. It generally works well, so I didn’t connect it.

Blanc · November 29, 2020, 8:11pm

Basically that means that the text layer of your PDF contains these errors; it is that layer from which the text is copied when you copy & paste. Typically what will have happened is that either the PDF (e.g., compression, resolution, text size) or OCR engine are of inferior quality and so letters are not correctly recognised. The only solution is to re-OCR, assuming that the OCR engine was the problem. If the PDF itself is of low quality, there is basically no solution.

I have seen cases of intentional errors in text layers, intended presumably as a cheap watermark or copy protection. Performing OCR can help in these cases too.

BLUEFROG · November 29, 2020, 8:46pm

Yes, the quality of the original would certainly affect the resulting OCR output.

I’m curious: What are you using a primary and secondary languages you’re using in Preferences > OCR?

Hankk2 · November 30, 2020, 2:19pm

You mention the issue is occurring with combination involving f: fi, ff, ffi, etc. In many fonts, character combinations such as these are special cases. The combination ‘ff’ is not drawn as two just two ‘f’ characters, but as a ‘ligature’ or ‘glyph,’ which is a single character which has two f’s inside of it. (If you look closely at how books are printed, you’ll see this – f’s are big and curvy so they often collide with other characters, especially in these combinations. ß and æ are related examples, where one character in the font can represent multiple characters of text.)

So, the issue is not one of resolution of the original, but one of font encoding by the program that made the original PDF. The ligatures are set properly to display to the screen, but the text version of them in the PDF was not set properly. You may be able to fix that by re-OCR-ing it, or (if you have access) recreating the PDF.