PDF conversion bug in 3.8.6: Text layer corrupted

Good afternoon,

since updating to DT3 Pro 3.8.6 I’m seeing a strange bug whenever I convert a webarchive to a PDF. The text is fine before the conversion, but in the converted PDF the text layer seems to be corrupted. It produces only gibberish.

Does anyone else have this issue as well?

I just tried it with DEVONthink 3.8.6 on this web page, and didn’t see an issue.

Perhaps give the URL of the troublesome page for others to try?

It seems to affect all webarchives I captured since the update. My workflow is to first capture a website as an webarchive, then trim it down to what I need, and then convert it from within DT3.

Here’s just one example of many.

And this is the PDF that DT3 created:

Moog Taurus.pdf (328.0 KB)

Sorry, but that URL produced what appears to me to be a good PDF+Text and text searching works when using DEVONthink to view it, as it works when using Apple Preview. And I imported your β€œMoog Taurus.pdf” file into DEVONthink and it also seems to be ok both in DEVONthink PDF viewer and in Apple Preview. I know this not much help. Perhaps your β€œtrimming” has something to do with what you see.

Odd. I tried the file in DT3, Preview and Acrobat DC, and in all three cases I only got gibberish when I copied/pasted text from it. :thinking:

I just downloaded Moog Taurus.pdf, opened it in Preview, copied the paragraph following the heading β€œA Breed Apart”, and pasted it into TextEdit. This is what I got:

(7*%3A"A67"+*A"$,%β€˜%β€œβ€˜679:’”)%β€œ)5)#3)G3%β€œ69”)9A”,)3WC:%+%9$β€œ)9)36?7%”&696’A9$,t"H%33"$6")" +%*$)#9"%2$%9$β€œ8o:”)?%%"G7$"$,%"=)7*7'"V7'$"'%%&'"$6",)5%"$,)$"#9:%_9)G3%"o'6&%$,#9?o"$,)$" '%$'"#$")F)*$J"=,%"+36'%'$"'679:"8"+)9"?%$"#'β€œF%,)F’"F%:#+$)G3A"W*6&")"<#9#<66?β€œG7$”$,%%β€œ#'” β€˜$#33")"’#?9#_+)9$":#q%%9+%J"E9A69%β€œa,6”%?73)3A"7’%'β€œ)9”)9)36?7%"'A9$,"W6"G)β€˜β€™":7$#%'β€œa#33” G%β€œa%33”)a)%β€œ6W”$,%β€œF6G3%&β€˜β€œ$,)$”+)9"G%"+)7’%:β€œGA”$,%β€œF,)'#9?β€œG%$a%%9”$a6”+36’%3AC$79%:" 6’+#33)$6’J"H,%9”$,%”$a6"a)5%W6*&β€™β€œ)%β€œ#9"F,)'%"$,%"*%'73$#9?"'679:"#'"'$*%9?$,%9%:")9:" G%+6&%'"367:%*J"B695%*'%3A"a,%9”$,%"6’+#33)$6’”)%"67$"6W"F,)'%"$,%"'679:"a%)>%9'")9:"$,%" 5637&%":*6F'J"=,#'"+)9"&)>%"$,%"G)''"+69$%9$"6W")"$*)+>"W):%"#9")9:"67$"a#$,"$,%"G%)$#9?"6W"$,%" 6'+#33)$6*'J"=,#'"%q%+$"+)9"6W"+67*'%"G%"#*69%:"67$"GA"7'#9?")"+6&F*%''6*β€œG7$β€œa#$,”$,%”=)77’" $,%"F*6G3%&"9%5%*")*#'%'"#9"$,%"_*'$"F3)+%J"=,%"6'+#33)$6*'"G%)$")?)#9'$"69%")96$,%*β€œ)9:”?#5%" $,%β€œ#+,β€œβ€˜a#3"$,)$β€œ#'”'6")FF%)3#9?β€œ$6”$,%"%)’"G7$"$,%"G)''"+69$%9$"*%&)#9'"'63#:")9:" +69'#'$%9$J"8"69+%")'>%:")9"%2C<66?"%&F36A%%")G67$"$,#'β€β€˜7??%’$#9?"$,)$"F%,)Fβ€™β€β€˜6&%β€œW6*&” 6W"+6&F*%β€™β€˜#69"p"#9$%9$#69)3"6*β€œ6$,%a#β€˜%"p"a)β€™β€œ$)>#9?β€œF3)+%”#9’#:%”$,%"+#+7#$A`β€œG7$”,%" +)$%?6#+)33A”:%9#%:β€œ$,)$”$,#β€™β€œa)'”$,%"+)'%J

This is with Monterey 12.6 on an Intel processor.

Cheers,
Andy

Ah … when copying out you see the problem.

From your PDF uploaded, I see copy/pasted pasted for the first paragraph:

@)+>"#9"$,%"3)$%"Reg!'β€œ)369?'#:%”)β€œ>%AG6):"F3)A%β€β€˜7**679:%:"GA"K)&&69:"6*?)9’" B3)5#9%$'β€œ<%336$*69’”)9:"<#9#<66?'"A67"

From the PDF I created from a convert from Web Archive, I see copy/pasted for the same first paragraph:

=64:β€œ#/”$<%">6$%β€œW^b!β€˜c"6>1/;’#2%β€œ6”:%?F1602"D>6?%0”'30013/2%2"F?"J6&&1/2"10;6/'c @>6d#/%$β€˜c"7%>>1$01/β€™β€œ6/2"7#/#711;'c”?13"4

Interestingly, different gibberish.

From the web archive from which the PDF above was created, I see copy/pasted:

Back in the late 1970s, alongside a keyboard player surrounded by Hammond organs, Clavinets, Mellotrons and MiniMoogs, you

Some progress, but I frankly have no idea what is causing this. Others with more expertise about PDF’s can comment. I believe DEVONthink uses β€œPDFKit” with an Apple API, but beyond that I have no special knowledge.

macOS 12.6

FYI, if I save your URL page to Markdown, then β€œconvert” to PDF paginated from that MD file, then copy/paste the same text from DEVONthink’s view of the PDF:

Back in the late 1970s, alongside a keyboard player surrounded by Hammond organs, Clavinets, Mellotrons and MiniMoogs, you

When I use the DEVON think Clipper to save direct to PDF, then copy page from Apple Preview, I get with same text:

=64:β€œ#/”$<%">6$%β€œW^b!β€˜c"6>1/;’#2%β€œ6”:%?F1602"D>6?%0”'30013/2%2"F?"J6&&1/2"10;6/'c @>6d#/%$β€˜c"7%>>1$01/β€™β€œ6/2"7#/#711;'c”?13

Might be something with source HMTL? I didn’t try other web pages (other duties call right now).

What browser are you using?

Safari 16.0 on MacOS 12.6 (M1 iMac).

What are Preferences > Files > Import > Text encoding and Web > Text encoding set to?

Both to Automatic.

For me β€œAutomatic”.

@schieferk

Hmm - I…

  1. Capture the webarchive.
  2. Data > Convert > To Plain Text. All is well.
  3. Strip extraneous content and save.
  4. Data > Convert > To Plain Text. All is well.
  5. Data > Convert > To PDF (One page).
  6. Data > Convert > To Plain Text. All is well. Concordance also looks good.
  7. Copy from PDF.
  8. Command-N to create a new file from the clipboard. All is well.

My files:
Archive.zip (56.6 KB)

Humm here too …

What if you use system clipboard and/or PopClip to copy paste (rather than Command-N)?

Another difference for me, I was converting (and saving from DEVONthink Clipper) to PDF Paginated.

PopClip capture looks perfect.

Clipping directly to a paginated PDF works as expected here.

Restarted the computer and tried again. The captured webarchive looks fine when I follow the steps you just outlined. When I convert directly to PDF from the webarchive though (even without trimming), I again get only gibberish in the text layer.

EDIT: The text layer is similarly corrupted when I capture directly to PDF, btw.

Quick update: I tried to capture the site directly to PDF with EagleFiler as well, and I get the same result. It seems that for me this is not a DT3 issue, but a general MacOS issue.

But to clarify the text is displayed just fine in the presentation of the PDF to the screen. So, is it really β€œtext layer”? It is the related I hypothesise to the copy into the system clipboard and/or copying out of the system clipboard when pasting. Or am I throwing out a red herring?