Storing HTML source and PDF in a record

FiddleDiddle · February 29, 2024, 2:19pm

So my intention is to store both the PDF and HTML source of a tab into one record.

I prefer to review documents in PDF format but if I store the source HTML as well, so I can extract meta data from it in the future as required.

My code below seems to work fine, with the exception of the “set source of theRecord to theHTML”.
The documentation seems to imply this should work…

source ([text] : The HTML/XML source of a record if available or the record converted to HTML if possible.

		set name of theRecord to do JavaScript ("document.getElementsByTagName('h1')[0].innerHTML") in thisTab
		set theHTML to source of thisTab
		set thePDF to PDF of thisTab
		set data of theRecord to thePDF
		set URL of theRecord to theURL
		set source of theRecord to theHTML

No matter what I do the source of theRecord is always empty. Any suggestions ?

Thanks

Frank

chrillek · February 29, 2024, 2:32pm

How do you create theRecord, i.e. which type do you provide?

source makes only sense with HTML or if the content can be converted to HTML
data is not defined for HTML records

Looking at your code, you seem to be retrieving the HTML from a document open in a DT window (although you don’t say so, which makes me take an educated guess). But there’s no point in guessing – please describe what you do in detail and provide complete code.

Aside: Setting the name property to an HTML-valued attribute is probably non-sensical – you want text in the name, not arbitrary HTML. For example, the innerHTML of the h1 element on this HTML page is

'<span class="topic-statuses"></span><a class="widget-link topic-link" href="/t/storing-html-source-and-pdf-in-a-record/78863" title="" data-topic-id="78863"><span>Storing HTML source and PDF in a record</span></a><span class="header-topic-title-suffix">      \n\n</span>'

Is that what you want to have as the name of your record?

So, instead of innerHTML, use innerText.

FiddleDiddle · February 29, 2024, 2:51pm

Thanks for the quick response. I’m creating the record as a “PDF document”.

Based on what you’re saying it seems that storing the HTML in the “source” field in a PDF type record is not an option. So it seems like what I would like to achieve is not going to work. Or I’m going to need to do something a bit odd like storing it in a customer meta data field.

Thanks for the heads up on the innerText. I have to admit innerHTML has been working fine. But I’ll test and switch to innerText if it works.

Thanks for your help.

chrillek · February 29, 2024, 2:55pm

Right, that’s rather odd.

I gave you an example where it doesn’t work fine. It all depends on the inner HTML of the h1. If, for example, it contains a link or other things, the HTML will be different from the text. innerText always does The Right Thing™

BLUEFROG · February 29, 2024, 3:57pm

Why would you be trying to set the source of a PDF file like this? The source isn’t intended to be a place to store data like this.
What you’re doing should be using custom metadata.

chrillek · February 29, 2024, 5:15pm

To store the result of converting PDF to HTML? I don’t even get what would be the point of duplicating information like that.

BLUEFROG · February 29, 2024, 5:17pm

I don’t understand the point of it either, but if there is a reason I’m missing the data should be stored in custom metadata, not trying to set the source of the document.

FiddleDiddle · March 1, 2024, 11:51am

The situation is, I have access to a site that has a history of the work I have done. The current version of the site is due to be shutdown in 12 months time. The replacement won’t contain the legacy data. The site hosts a page for each of my tasks. Each page includes a large number (80+) of fields of useful information as well as large sections of technical text.

I’m copying this information into DEVONthink. Thanks to @cgrunenberg and @chrillek I have a script that opens each page in turn and saves it as a PDF record. As it saves it, it extracts some fields from the HTML and saves them as custom metadata along side the PDF. I need to save the PDF for regulatory requirements so that I can demonstrate the content has not been tampered with (Yes I know PDFs can be tampered with but this is the regulatory standard).

However, as time goes by I’ve realised that there are more of the 80+ fields that I would like to extract from the HTML and add to custom metadata fields of the PDF records. I believe I have 4 options

Attempt to extract the fields from the PDF. I wouldn’t even know where to start with this.
Rerun my script now and extract all 80 fields and store them as custom meta data in the PDF record although I know this is over kill. And due to the lack of "id"s in some of the fields I could spend a lot of time trying to extract the fields but never need them.
In the short term I could re-retrieve the source HTML from the site, extracting the specific fields I require and update the custom meta data of the existing PDF records. But with literally 10,000+ documents this is going to be a burden on the host and I would prefer not to do this each time I decide I want to extract more data from the HTML.
In the ideal world I would do one more pass of the site downloading the page HTML and attach the HTML to the existing PDF records. Then, in the future if I identify a specific field that I’d benefit from as a custom meta data field I can write a script to visit all of the records that contain the HTML source, extract the fields I require and add them as custom meta data to the existing PDF record.

The DEVONthink documentation doesn’t make it clear that some fields such as “source” are only available to certain classes of record type, hence my misunderstanding.

I don’t want to create and manage duplicate records one HTML record and one PDF record. So storing the entire HTML source as a custom metadata field appears to be a possibility but I wouldn’t have thought it ideal, so I’m looking for better solutions.

Is there a more appropriate field in a standard PDF record where I could store the HTML source ? Do you have any suggestions ?

chrillek · March 1, 2024, 12:03pm

If the PDF contains a text layer, you could try to use a script to access the record’s plainText property and extract the fields from there. May or may not work, depending on the organization of the plain text – it need not follow the original layout. But that’s easy to determine.

But you’ve already burdened the host once. I don’t think doing it another time is more problematic than the first time. And you could combine this approach with “2”, i.e. access each HTML, retrieve all fields and add them as metadata to the corresponding PDF. Then you’ll probably have too much data, but that’s better than missing data, isn’t it?

Why would that be “ideal”? You’d simply duplicate information, and in a not very accessible format, at that.

How is that simpler than retrieving the data from the original HTML on the website and adding them as metadata once?

It actually does. I found that by looking at the scripting dictionary.

Nope. PDF has only a very limited number of standardized metadata.

FiddleDiddle · March 1, 2024, 12:32pm

Thanks. I wasn’t aware the plainText property was available. I’ll investigate.

There are various rate limits that I have to comply with. Making multiple passes exceeding the monthly utilisation could see me temporarily blocked. So I need to be careful. But yes one more pass and extracting all the fields is better than repeated passes extracting only a few fields at a time. It’s just the effort to code for all 80+ fields when many don’t have convenient "id"s.

It would mean that long after the current tool is gone, I can go back and extract anything I was missing from the HTML. I’ll take the hit on it being a duplicate and not very accessible as at least if I have it I can at least access it.

Because the source HTML isn’t great and working out how to extract every field is going to be a significant amount of work. I’d prefer to only spend that time extracting those fields if & when I decide I really need them.

I must be looking in the wrong place as this is all I see in the dictionary…

source (text) : The HTML/XML source of a record if available or the record converted to HTML if possible.

I read this as

the field called “source”
Its content type is “text”
It’s not “read only” so I can update
Its expected use is to
– contain “The HTML/XML source of a record if available” seems like a perfect match for my requirement
– contain “the record converted to HTML if possible.”

I’m keen to learn so if there is a different interpretation or a more detailed explanation somewhere else please share.

chrillek · March 1, 2024, 1:24pm

Only once, and then it’ll work for 10000 documents. Hopefully. And “the source HTML” is all you have, anyway. If you think about converting the PDF to HTML – don’t. That’ll probably not yield a better structured HTML than the original one.
In any case, a sample PDF and original HTML would help to figure out if/how to get at the data. As it stands, it’s all a bit foggy.

But there is no HTML/XML source available for this record. You’re kind of reading it backwards. source is meant to access the HTML/XML of an HTML/XML record. Not to add stuff to arbitrary records. And in the case of an HTML record, one can actually modify source, creating all kinds of havoc.
Since it’s possible to convert a PDF to HTML (at least in the one case I tried it with), the documentation seems in fact to imply that this conversion might happen on the spot, creating an HTML document in the source property. Which doesn’t happen, though. @cgrunenberg would have to comment on that

FiddleDiddle · March 1, 2024, 2:16pm

I can see that as an alternative interpretation. But then I’d expect the field to be read only.

Which makes me think again that it should probably be read-only. But then I guess there are all sorts of strange people out there with strange use cases.

chrillek · March 1, 2024, 2:54pm

Why? Markdown has the plain text property which is very useful to modify MD documents programmatically. source does the same for HTML. I’d rather ask why we need two properties for essentially the same thing. And if it’s reasonable to have a writable plain text in the case of PDFs where modifying it can get the OCR level and the visual out of sync