(Solved) Character set issue

Rainer · October 6, 2020, 3:20pm

… strange issue with the character sets. There is a difference between the representation of the same HTML document on Mac and on iPad (with DTTG). Any ideas on what I need to change on my iPad???

chrillek · October 6, 2020, 3:31pm

It might be helpful to post the URL here. Did you save a bookmark or …? It’s the problem occuring only with this URL or with all pages on Faz.de?
As far as I can see, the page sets the character set correctly (utf-8)

If I save a bookmark to faz.de in DTTG on the iPad, it looks ok.

Rainer · October 6, 2020, 5:01pm

… it happens with EVERY page like this. I save the HTML document on my Synology-NAS with UTF8 and then I move it to my download directory on my Mac and import it from there to DEVONthink - result see above.
I also did the download first from NAS to iPad, imported to DTTG, then Sync, then watching it on my Mac.

Which way I every try - the result is ALWAYS the same …

BLUEFROG · October 6, 2020, 5:03pm

iOS ≠ macOS so the technologies involved are not necessarily the same between DEVONthink and DEVONthink To Go.

Rainer · October 6, 2020, 5:05pm

… ah, good to know … Any idea on how I could convert it on my iPad?

BLUEFROG · October 6, 2020, 5:11pm

Convert it to… ?

Rainer · October 6, 2020, 5:16pm

… well, to a clutter-free version ;-)…
But maybe this is rather a iOS than a DTTG issue …

BLUEFROG · October 6, 2020, 5:23pm

The clutter-free mechanism requires passing a URL, not a document.

chrillek · October 6, 2020, 7:34pm

How? Download from the browser, curl…? If you look at this file with DTTG, is it still ok?

How do you move it (and why move instead of copy)? If you look at it with DTTG, is it still ok?

What I see above is a correct page in DT on the Mac and one with broken character encoding in DTTG. How did it get from the Mac to DTTG? Sync? Import somehow? How does the page look in Safari on the iPad? If you save the file to the files.app on the iPad, is the character encoding still set to utf-8?
What format is the file in on the Mac in DT - bookmark, markdown/uncluttered, webarchive/uncluttered?

chrillek · October 6, 2020, 7:35pm

I suppose that the OP meant “correct character encoding” when they said clutter-free

Rainer · October 6, 2020, 7:57pm

How? Download from the browser, curl…? If you look at this file with DTTG, is it still ok?
Actually I did some scripting in Node-Red (Docker container): save the HTML content of print version of the web site in a HTML file. Copied this file from the Docker container to a “normal” NAS directory. From there via “share” to Devonthink

How do you move it (and why move instead of copy)? If you look at it with DTTG, is it still ok?
See above

What I see above is a correct page in DT on the Mac and one with broken character encoding in DTTG. How did it get from the Mac to DTTG? Sync? Import somehow?
WEBDAV - but even if I send the HTML file via email - without and DEVONthink: Problem is always the same!
How does the page look in Safari on the iPad? If you save the file to the files.app on the iPad, is the character encoding still set to utf-8?
What format is the file in on the Mac in DT - bookmark, markdown/uncluttered, webarchive/uncluttered?
Whatever I do with this file on my Mac - looks okay!

Rainer · October 6, 2020, 8:07pm

… if I convert the HTML into a PDF on my Mac - then this PDF is fine on the iOS side too …
… and the problem exists - as far as I can see - mainly with German specific characters like “Ä, Ö, ß, …”

chrillek · October 7, 2020, 5:29am

It is still not clear what you did. What is the “print version” of the HTML file (PDF? Markdown? HTML?) What is it’s “content” and why do you go through all these loops just to save a simple HTML file? And why do you choose to not answer concrete questions that might help solving the problem?

As to your last post: what you’re seeing in DTTG is utf-8 encoded data interpreted as latin-1. Somewhere in your convoluted process, you drop the character encoding.

In my opinion, this has nothing to do with DTTG. It is perfectly capable to display the original html, be it as a bookmark or a webarchive. So just do this: save HTML, and it should work.

Rainer · October 7, 2020, 7:29am

… I should have started to explain a bit the “project” and only the not working part …

I want to set up my private news parser which ideally runs without handling from my side:
have a cup of coffee in the morning, take your iPad OR your iMac and have a look on the news.
Therefore I parse regularly on my Synology some websites and grab interesting websites clutter free - but not only the link, I want to have an HTML or PDF in my environment: realized that in a couple of cases the original content is not on the internet anymore.
I created some Node-Red-workflow running in a docker container on my Synology - and the problem mentioned above appears, when I try to get this document out of my docker container in a DEVONthink database…
… Meanwhile I found out that the root cause is most likely the “download mechanism” of my Synology. When I download the HTML files from docker container directly to the Mac then the the content is “damaged” afterwards (see above).
When I use the linux file copy - then everything is fine.

cgrunenberg · October 7, 2020, 7:37am

Could you please zip the HTML document and send it to cgrunenberg - at - devon-technologies.com? Thanks in advance!

Rainer · October 7, 2020, 7:39am

… Problem solved - If I adapt the character set UTF8/Latin, then everything works as expected!

chrillek · October 7, 2020, 1:23pm

Could you please explain what exactly you did to fix this? The HTML is already in utf-8, so there would be no need to adapt anything.
I don’t mean to be rude, but in my opinion more clarity on the description of problem and solution would greatly speed up and ease the discussion.

Rainer · October 7, 2020, 1:48pm

… you are more than right, chrillek! Sorry for being not clear enough!
… I use chrome and safari on different devices and now I identified, that they had different character sets in their respective settings.
I only now recognised this and switched the creation of my HTML to Latin-1. now everything is fine.
Sorry again for the unclearness on my side.

chrillek · October 7, 2020, 1:57pm

That makes no sense at all. Browsers nowadays recognize the character encoding of the HTML. There’s absolutely no need to set a fixed encoding in the browser unless the file is displayed incorrectly.

I’d discourage that. First, the HTML has the correct encoding (no, I’m not getting tired of pointing that out). So, second, there’s no point in changing it at all. Third, downgrading from utf-8 to Latin-1 is asking for trouble, since utf-8 is a superset of Latin-1. In other words, you’ll lose all characters that are not part of Latin-1. Given that e.g. Süddeutsche Zeitung uses the correct characters in non-latin-1 names (whereas Spiegel just “converts” them), you might run into similar issues as before.

Why don’t you just download the HTML with wget/curl or something similar and store that?

Rainer · October 7, 2020, 2:32pm

Given that e.g. Süddeutsche Zeitung uses the correct characters in non-latin-1 names (whereas Spiegel just “converts” them), you might run into similar issues as before.

Good point. I have to keep that in mind.

Why don’t you just download the HTML with wget/curl or something similar and store that ?

I try to do all my workflow in Node-Red. I am just a beginner, but I did not yet see a node which allows download of files from the internet. I use the nodes of Webdriver.IO for the parsing. And they give you the CONTENT of website - which you then can put to a HTML file.