Inject HTML back into a DTP tab?

I’ve extracted the HTML “source” from a DTP tab.

set theHTML to source of theTab

I’ve then manipulated theHTML to my requirements. I would now like to reload the theHTML back into theTab.

I can successfully save theHTML to a temporary file and then

set  URL of theTab to "file://" & theFile

And then continue processing. But I would prefer to avoid the overhead of writing to a file, just to load it into a tab, and then deleting the file.

I’d would like to

set source of theTab to theHTML

But source is a read-only parameter. Is there a clever way around this other than altering the HTML in the tab via JS?

Thanks

S

Is there a clever way around this other than altering the HTML in the tab via JS?

I don’t think so. With JS, you could try to set document.innerHTML To your HTML. But that might actually not work for security reasons.

What are you trying to accomplish with these steps?

Yes, what are you actually trying to accomplish?

Highest level;
Trying to extract a large number of web pages hosted in an Oracle/APEX database before it is decommissioned for good. I want to save these in DTP as PDFs. In principle this is working and working well.

For bonus points;
Each web page contains a number of links to other documents in the system. The URLs point back to the Oracle/APEX system. These are nicely captured in the PDF when you “set thePDF to data of theTab”. However the links point back to the old Oracle/APEX system.

I have been trying to work out how to update the annotations in the PDF. However, I realised that if I update the HTML [ I tested using setAttribute(“href”, “x-devonthink-item://3AE8B440-1F70-4E33-9E42-0C9591298498”); ] within theTab it then renders the PDF with the correct link. But I need to do this for each link in each document.

This is very messy POC code

	set theHTML to source of theTab
	# Search the HTML for all hrefs that appears to include a documentID
	set thePattern to "href='.*&id=[[:digit:]]{4,}-[[:digit:]]{4,}.*'"
	set theMatches to regex search theHTML search pattern thePattern
	
	# For each matching "href"
	repeat with theMatch in theMatches
		# Split the HTML section by "&"
		set theAResults to split string theMatch using delimiters {"&"}
		repeat with theAResult in theAResults
			# For each "&" section split it into KVP e.g. id=xyz
			set theBResults to split string theAResult using delimiters {"="}
			if first item of theBResults is "id" then
				#We've found the KVP we're looking for 
				set documentID to second item of theBResults
				#Clean the document ID
				set documentID to oralib's trimText(documentID, {"'"})
				if oralib's isValidDocumentID(msgPrefix, documentID) then
					# get the x-devonthink-item:// URL for the associated document 
					set documentURL to oralib's getDocumentIdReferenceURL(msgPrefix, documentID)
					# construct a replacement href 
					set replacementURL to "href='" & documentURL & "'"
					# replace the original href with the new one
					set theHTML to toollib's stringReplace(theHTML, theMatch, replacementURL)
# Or potentially generate a JS command to perform the replacement in theTab instead
				end if
			end if
		end repeat
	end repeat
	my writeTextToFile(theHTML, theSourceFile & "_updated", true)

As I don’t know what documentIDs I’m going to find in a document and need to search DTP for the x-devonthink-item link switching back and forth between AppleScript and JS as I work through the document is beyond my skills at the momemt.

I don’t like using regex to manipulate theHTML, but I’ve struggled with XPath between AppleScript/JS and struggled to get NSXMLDocument to processes the pages successfully.

Off hand, I don’t see a way to avoid the temp HTML files.

This should be possible via the do JavaScript command.

What about creating a new record via…

set theRecord to create record with {name:"Dummy", source:theHTML, type:html} in (current group)

…and then converting this record to the desired PDF format and finally deleting the dummy record?

convert record theRecord to PDF document
delete record theRecord
1 Like

This is great, thanks.

I’ll investigate both approaches.

Thanks to you all.

Also, instead of switching to and fro between AppleScript and JavaScript, you could write everything in JS.

What might (!) work:

  • while the html is loaded in the tab
  • use doJavascript to find all a elements whose href is in your Oracle/Apex database. Best to do this with document.querySelectorAll("a[href...]"] and an attribute selector that matches the actual href values.
  • return these hrefs to your main script as a JSON.stringify()d array, where
  • you convert the string back to an array (JSON.parse()) and build an object whose keys are these hrefs and their values are the item links
  • then create (still in this main script) a JavaScript script that loops over this object, replacing each of its keys in the document’s a elements with its value.
  • finally, sends this script to the document in the tab using doJavascript again

I’m limited to my iphone now, so can’t write longish code.

Thanks everyones for your suggestions!

Just in case its of any use to somebody in the future my solution was to use

		set theHTML to source of theTab
		set theLinks to get links of theHTML

To get all of the HTML links in the tab. Then I filter the different types of links and extract the documentID. I use the documentID to search DTP to retrieve the x-devonthink-item:// URL.

I then use the handler below to update the URL in the tab.

to updateLinkInTab(msgPrefix, theTab, oldLink, newLink)
	tell application id "DNtp"
		do JavaScript ("var els = document.querySelectorAll(\"a[href^='" & oldLink & "']\");
							for (var i = 0, l = els.length; i < l; i++) {
	  							var el = els[i];
	  							el.href = '" & newLink & "';
						}") in theTab
	end tell
end updateLinkInTab

This approach avoids needing to extract the XPath for each href and then safely escaping the XPath string to pass it back into the “do JavaScript”

After that its the usual

			set thePDF to PDF of thisTab
			set data of theRecord to thePDF

I’m sure in time I’ll find an issue with this code but so far I’ve processed 250 documents with it and it seems to be working OK.

Thanks again for all the ideas and suggestions.

Why do you quote the new link twice in JavaScript?

Almost certainly down to my lack of familiarity with the languages involved.

I believe JS is expecting something like …

el.href = 'x-devonthink-item://309CD4F7-3F16-46E5-8580-E53477C15365?reveal=1';

So to build JS instruction I’m concatenating the static and variable parts together. Is there a better way to do this ? As this approach has given me a huge headaches with strings that need escaping, especially when multiple quotes of different types are required.

As I hinted at before, I’d use JavaScript all the way, not AppleScript at all. Then you could use template strings in JS which don’t require quoting.