Split HTML file into multiple files (LexisNexis results)

This query is similar to my question in the following thread, although the result / purpose is somewhat different so I figured I would create a new thread.

LexisNexis is an online database which I use to search print based news media. It allows users to download all or part of the results in a single document (I choose the HTML format since this presents the results most clearly). What would be great is a script that would take this HTML file once inside DevonThink and split it into several new items. This way I can utilise the power of DevonThink to analyse the results from LexisNexis, instead of having to wade through all the results on their rather cumbersome online interface.

Given the great solution edf came up with in the aforementioned thread, I’m pretty certain this is possible, but I have a fairly limited (read: non-existent) AppleScript knowledge so some assistance would be great!

I have uploaded a sample of the results file from LexisNexis here:
http://phishtank.info/einar_thorsen/files/pubGuardian.html

Each document starts with the following:

<!-- Hide XML section from browser
<DOC NUMBER=1>
<DOCFULL> -->

… where the number “1” changes for each document.

Each document ends with the following:

<!-- Hide XML section from browser
</DOCFULL>
</DOC> -->
<BR>

Each document then starts with the count of the given search, the copyright of where the text came from, the date of the original publication, the section where it was published, word length, headline and byline. Each of these are presented on a seperate line and is followed by the body text. Example of code:

<BR>
<DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">1 of 48 DOCUMENTS</SPAN></P>
</DIV>
<BR>
<DIV CLASS="c3"><P CLASS="c1"><SPAN CLASS="c2">Copyright 2005 Guardian Newspapers Limited &nbsp;<BR>
The Guardian (London) - Final Edition</SPAN></P>
</DIV>
<BR>
<DIV CLASS="c3"><P CLASS="c1"><SPAN CLASS="c2">May 5, 2005</SPAN></P>
</DIV>
<BR>
<DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c6">SECTION: </SPAN><SPAN CLASS="c2">Guardian Life Pages, Pg. 15</SPAN></P>
</DIV>
<BR>
<DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c6">LENGTH: </SPAN><SPAN CLASS="c2">878 words</SPAN></P>
</DIV>
<BR>
<DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c6">HEADLINE: </SPAN><SPAN CLASS="c2">Online: Inside IT: An autonomous source of news: The government might at last have found a winning technological combination, writes Bobbie Johnson</SPAN></P>
</DIV>
<BR>
<DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c6">BYLINE: </SPAN><SPAN CLASS="c2">Bobbie Johnson</SPAN></P>
</DIV>
<BR>
<DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c6">BODY:<BR>
</SPAN><SPAN CLASS="c2"> </SPAN></P>
<P CLASS="c7"><SPAN CLASS="c2"> Dealing with truckloads of information is never an easy job, but when etc etc etc

The CSS code is as follows:

<STYLE TYPE="text/css"><!--
.c0 { text-align: center; }
.c1 { text-align: center; margin-top: 0em; margin-bottom: 0em; }
.c2 { font-family: Courier; font-size: 10pt; font-style: normal; font-weight: normal; color: #000000; text-decoration: none; }
.c3 { text-align: center; margin-left: 13%; margin-right: 13%; }
.c4 { text-align: left; }
.c5 { text-align: left; margin-top: 0em; margin-bottom: 0em; }
.c6 { font-family: Courier; font-size: 10pt; font-style: normal; font-weight: bold; color: #000000; text-decoration: none; }
.c7 { text-align: left; text-indent: 4%; margin-top: 1em; margin-bottom: 0em; }
--></STYLE>

Ideally the script would:

  1. Take the contents of the selected file in DevonThink.
  2. Create a new document containing the information between the tags mentioned first ( and ).
  3. Set the title of the new document to the “HEADLINE” tag.
  4. Set the creation date of the new document to the “date” tag (or if not possible then add this to the comments field).
  5. Add the “SECTION” tag to the end of the comment field.
  6. Add the CSS style sheet reference for correct markup to the start of the HTML code of the new document.
  7. Repeat 2-6 until the end of the selected file in DevonThink.

I realise this is perhap a bit more of an ask than the previous request, but I would really appreciate someone giving me a hand with this. I am sure many other people use LexisNexis too so hopefully this will be of use to others as well! :slight_smile:

I’ve spent today battling with AppleScript and I have created the following solution, which does all of what I required, apart from setting the date to the creation date of the new record…

-- Split LexisNexis results file(s) into separate database entries
-- Written by Einar Thorsen, 2006

-- Grabs the total number of documents which is used to limit the loop

on extract_number_of_documents(html_src)
	
	-- set start of specified document
	
	set xml_tag to "<META TOPIC="
	
	-- grab the total number of documents in file
	
	set xml_str to text from character (offset of xml_tag in html_src) to (length of html_src) of html_src
	set content_str to text from character ((offset of "DOCUMENTS=" in xml_str) + 11) of xml_str to (length of xml_str) of xml_str
	
	set result to text from character 1 to ((offset of "UPDATED" in content_str) - 3) of content_str
	
end extract_number_of_documents

-- Defines the area which constitutes a separate document

on extract_document_from_file(html_src, document_number)
	
	-- set start of specified document
	
	set xml_tag to "<DOC NUMBER=" & document_number & ">"
	
	-- grab the contents of the specified document
	
	set xml_str to text from character (offset of xml_tag in html_src) to (length of html_src) of html_src
	set content_str to text from character ((offset of "<DIV CLASS=\"c3\"><P CLASS=\"c1\"><SPAN CLASS=\"c2\">" in xml_str) + 0) of xml_str to (length of xml_str) of xml_str
	
	set result to text from character 1 to ((offset of "</DOC>" in content_str) - 1) of content_str
	
end extract_document_from_file

-- Extracts specified information from the document

on extract_document_info_from_file(html_src, document_number, xml_tag_name, xml_offset)
	
	-- set start of specified document
	
	set xml_tag to "<DOC NUMBER=" & document_number & ">"
	
	-- grab the information of the specified document
	
	set xml_str to text from character (offset of xml_tag in html_src) to (length of html_src) of html_src
	set content_str to text from character ((offset of xml_tag_name in xml_str) + xml_offset) of xml_str to (length of xml_str) of xml_str
	
	set result to text from character 1 to ((offset of "</SPAN>" in content_str) - 1) of content_str
	
end extract_document_info_from_file

-- Extracts the date from the document selection

on extract_document_date_from_document(news_document)
	
	-- set start of specified document
	
	set xml_tag to "<DIV CLASS=\"c3\">"
	
	-- grab the date of the specified document
	
	set xml_str to text from character (offset of xml_tag in news_document) to (length of news_document) of news_document
	set content_str to text from character ((offset of "</SPAN>" in xml_str) + 71) of xml_str to (length of xml_str) of xml_str
	
	set result to text from character 1 to ((offset of "</SPAN>" in content_str) - 1) of content_str
	
end extract_document_date_from_document

-- Start of script cycle

tell application "DEVONthink Pro"
	show progress indicator "Starting process..."
	
	-- set the style sheet, required at start of each item to display the information properly
	
	set css_header to "<STYLE TYPE='text/css'><!-- 
					.c0 { text-align: center; } 
					.c1 { text-align: center; margin-top: 0em; margin-bottom: 0em; } 
					.c2 { font-family: Courier; font-size: 10pt; font-style: normal; font-weight: normal; color: #000000; text-decoration: none; } 
					.c3 { text-align: center; margin-left: 13%; margin-right: 13%; } 
					.c4 { text-align: left; } 
					.c5 { text-align: left; margin-top: 0em; margin-bottom: 0em; } 
					.c6 { font-family: Courier; font-size: 10pt; font-style: normal; font-weight: bold; color: #000000; text-decoration: none; } 
					.c7 { text-align: left; text-indent: 4%; margin-top: 1em; margin-bottom: 0em; } 
					--></STYLE>"
	
	-- set chosen files to process
	
	set rec_list to the selection
	if rec_list is {} then error "Please select a captured web page..."
	hide progress indicator
end tell

repeat with rec in rec_list
	
	tell application "DEVONthink Pro"
		
		set html_str to source of rec
		
	end tell
	
	set total_documents to extract_number_of_documents(html_str)
	
	repeat with i from 1 to total_documents
		
		tell application "DEVONthink Pro"
			hide progress indicator
			show progress indicator "Extracting document " & i & " of " & total_documents
		end tell
		
		set news_document to extract_document_from_file(html_str, i)
		set news_title to extract_document_info_from_file(html_str, i, "HEADLINE", 34)
		set news_section to extract_document_info_from_file(html_str, i, "SECTION", 33)
		set news_byline to extract_document_info_from_file(html_str, i, "BYLINE", 32)
		
		set news_date to extract_document_date_from_document(news_document)
		
		tell application "DEVONthink Pro"
			create record with {name:news_title, type:html, source:(css_header & news_document), comment:(news_date & " | " & news_section & " | By: " & news_byline)}
			
			hide progress indicator
		end tell
		
	end repeat
	
end repeat

I have tried this on the sample file I uploaded and it works through all of the 48 documents, sets the title correctly and the comment field as: date | section | byline.

If anyone could give me a hand with the final part of setting the creation date to “news_date” as above, that would be fantastic. The solution suggested by edf was something along these lines:

on fix_bbc_date(bbc_date)
	-- Converts YYYY/MM/DD to DD/MM/YYYY 
	set text item delimiters to " "
	-- date_time is { date, time } 
	set date_time to text items of bbc_date
	set text item delimiters to "/"
	-- ymd (YearMonthDay) is {year, month, day} 
	set ymd to text items of item 1 of date_time
	-- construct date string 'DD/MM/YYY hhhh:mm:ss' 
	set result to item 3 of ymd & "/" & item 2 of ymd & "/" & item 1 of ymd & " " & item 2 of date_time
end fix_bbc_date

Although the LexisNexis records folows quite a different format, e.g.: “May 5, 2005”… I don’t know how I would go about creating a DD/MM/YYYY version of that?

Given that this is my first attempt at a serious AppleScript, I would welcome any help to optimize (it does run slowly) or debug it (though it appears to run fine).

If anyone else uses LexisNexis, then please let me know if you found this useful or if you have other ways of handling information from this source.

:slight_smile:

heya einar,

I wish I had Lexis-Nexus access … haven’t had that since school well over a decade ago.

Been busy doing my own Applescript stuff this week, writing stuff to get our proprietary (binary) data loaded as Sheets in DT.

Anyways for the date, it is simple enough. Try this in Script Editor:


set date_str to "May 5, 2005"
set result to date date_str

Applescript can make a date out of an English date string no problem; it is when there are delimiters like - and / that the programmer must get involved to dis-ambiguate. That means your code can be changed to


set the_date to date news_date
tell application "DEVONthink Pro"
         create record with {name:news_title, type:html, source:(css_header & news_document), date:the_date, comment:( news_section & " | By: " & news_byline)}
         
         hide progress indicator
end tell

Regarding your Applescript code, performance-wise I can’t see anything obvious to fix. Applescript is not good at parsing stuff like XML, so it’s going to be tedious no matter what.

Style-wise, you might want to extract common code into a handler you can call. This gives you more building blocks to use later; you can use the handler in other scripts, or use it in future additions to this script.

The one that leaps immediately to mind is


set xml_str to text from character (offset of xml_tag in html_src) to (length of html_src) of html_src
set content_str to text from character ((offset of "DOCUMENTS=" in xml_str) + 11) of xml_str to (length of xml_str) of xml_str
set result to text from character 1 to ((offset of "UPDATED" in content_str) - 3) of content_str

which is repeated in


set xml_str to text from character (offset of xml_tag in html_src) to (length of html_src) of html_src
set content_str to text from character ((offset of "<DIV CLASS=\"c3\"><P CLASS=\"c1\"><SPAN CLASS=\"c2\">" in xml_str) + 0) of xml_str to (length of xml_str) of xml_str
set result to text from character 1 to ((offset of "</DOC>" in content_str) - 1) of content_str

Now I’m not sure how many calls like this you are going to have, but there is obviously a lot of commonality here.

You have an xml_tag which you look for in html_src, you have some starting text after the tag, an offset from the starting text at which you start capturing, some ending text that delimits (marks the end of) the information you are interested in, and an offset from the ending text where you stop capturing.

The first offset is (always?) positive, and the second is (always?) negative, so a very rough stab at a reusable routine is


on get_xml_text(xml_tag, html_src, start_text, start_skip_chars, end_text, end_backtrack_chars)
	set xml_str to text from character (offset of xml_tag in html_src) to (length of html_src) of html_src
	set content_str to text from character ((offset of start_text in xml_str) + start_skip_chars) of xml_str to (length of xml_str) of xml_str
	set result to text from character 1 to ((offset of end_text in content_str) - end_backtrack_chars) of content_str
end get_xml_text

This gives you the ability to define your other routines as


on extract_number_of_documents(html_src)
	-- grab the total number of documents in file
	-- format is <META TOPIC= ... DOCUMENTS=... UPDATED
	set start_text to "DOCUMENTS="
	return get_xml_data("<META TOPIC=", html_src, start_text, start_text + 1, "UPDATED", -3)
end extract_number_of_documents

on extract_document_from_file(html_src, document_number)
	-- Defines the area which constitutes a separate document
	-- format is <DOC NUMBER=${DOC_NUMBER}>" ... "<DIV CLASS=\"c3\"><P CLASS=\"c1\"><SPAN CLASS=\"c2\">" ... "</DOC>"
	return get_xml_data("<DOC NUMBER=" & document_number & ">", "<DIV CLASS=\"c3\"><P CLASS=\"c1\"><SPAN CLASS=\"c2\">", 0, "</DOC>", -1)
end extract_document_from_file

Really what you want to be able to do is to generalize the patterns you are looking for in the text (e.g. open:close:) and push those into parameters, then deal with the weird exceptions afterwards. For example, a routine like “give me the text between OpenTag and CloseTag” (which generalizes to “give me everything between TextA and TextB”) would probably be very useful in what you are doing. This turns the problem from “find substring A then find substring B then find substring C” to “get substring between A and B, then find substring C in it”.

Aside from that, it looks good. As I said, there’s not a whole lot you can do style-wise with applescript. I can’t believe I had to code my own split() and dictionary/hash structure… might as well be using C :slight_smile:

LexisNexis is great, but with this new script I’m able to apply the full power of DevonThink to my results… that makes it a million times more useful!! The combination is so powerful that I’m more than happy to perservere with this shoddy AppleScript stuff. I once had to program a puzzle game engine with randomiser functions in Lingo - I still recent it!

Anyway, your date suggestion worked a treat and this is now incorporated in the script. I also hit a bit of a problem when expanding my results. It seems not all the articles use the same markup, so I’ve had to put in some basic error checking and it now appears to work… I’ve not tried to combine the routines yet, just in case I hit more inaccuracies like this.

Below is the updated code if anyone is interested… I will post any changes if they are required, and certainly an update if I combine the routines.

-- Split LexisNexis results file(s) into separate database entries
-- Written by Einar Thorsen, 2006

-- Start of script cycle
tell application "DEVONthink Pro"
	show progress indicator "Starting process..."
	-- set the style sheet, required at start of each item to display the information properly
	set css_header to "<STYLE TYPE='text/css'><!-- 
					.c0 { text-align: center; } 
					.c1 { text-align: center; margin-top: 0em; margin-bottom: 0em; } 
					.c2 { font-family: Courier; font-size: 10pt; font-style: normal; font-weight: normal; color: #000000; text-decoration: none; } 
					.c3 { text-align: center; margin-left: 13%; margin-right: 13%; } 
					.c4 { text-align: left; } 
					.c5 { text-align: left; margin-top: 0em; margin-bottom: 0em; } 
					.c6 { font-family: Courier; font-size: 10pt; font-style: normal; font-weight: bold; color: #000000; text-decoration: none; } 
					.c7 { text-align: left; text-indent: 4%; margin-top: 1em; margin-bottom: 0em; } 
					--></STYLE>"
	-- set chosen files to process
	set rec_list to the selection
	if rec_list is {} then error "Please select a captured web page..."
	hide progress indicator
end tell

repeat with rec in rec_list
	tell application "DEVONthink Pro"
		set html_str to source of rec
	end tell
	
	set total_documents to extract_number_of_documents(html_str)
	
	repeat with i from 1 to total_documents
		
		tell application "DEVONthink Pro"
			hide progress indicator
			show progress indicator "Extracting document " & i & " of " & total_documents
		end tell
		
		set news_document to extract_document_from_file(html_str, i)
		set news_title to extract_document_info_from_file(html_str, i, "HEADLINE", 34)
		set news_section to extract_document_info_from_file(html_str, i, "SECTION", 33)
		set news_byline to extract_document_info_from_file(html_str, i, "BYLINE", 32)
		
		set news_date to extract_document_date_from_document(news_document)
		
		set the_date to date news_date
		
		tell application "DEVONthink Pro"
			create record with {name:news_title, type:html, source:(css_header & news_document), date:the_date, comment:(news_section & " | By: " & news_byline)}
			hide progress indicator
		end tell
		
	end repeat
	
end repeat

-- Start of routines

-- Grabs the total number of documents which is used to limit the loop
on extract_number_of_documents(html_src)
	-- set start of specified document
	set xml_tag to "<META TOPIC="
	-- grab the total number of documents in file
	set xml_str to text from character (offset of xml_tag in html_src) to (length of html_src) of html_src
	set content_str to text from character ((offset of "DOCUMENTS=" in xml_str) + 11) of xml_str to (length of xml_str) of xml_str
	set result to text from character 1 to ((offset of "UPDATED" in content_str) - 3) of content_str
end extract_number_of_documents

-- Defines the area which constitutes a separate document
on extract_document_from_file(html_src, document_number)
	-- set start of specified document
	set xml_tag to "<DOC NUMBER=" & document_number & ">"
	-- grab the contents of the specified document
	set xml_str to text from character (offset of xml_tag in html_src) to (length of html_src) of html_src
	try
		set content_str to text from character ((offset of "<DIV CLASS=\"c3\"><P CLASS=\"c1\"><SPAN CLASS=\"c2\">" in xml_str) + 0) of xml_str to (length of xml_str) of xml_str
	on error
		set content_str to text from character ((offset of "<DIV CLASS=\"c8\"><P CLASS=\"c1\"><SPAN CLASS=\"c2\">" in xml_str) + 0) of xml_str to (length of xml_str) of xml_str
	end try
	set result to text from character 1 to ((offset of "</DOC>" in content_str) - 1) of content_str
end extract_document_from_file

-- Extracts specified information from the document
on extract_document_info_from_file(html_src, document_number, xml_tag_name, xml_offset)
	-- set start of specified document
	set xml_tag to "<DOC NUMBER=" & document_number & ">"
	-- grab the information of the specified document
	set xml_str to text from character (offset of xml_tag in html_src) to (length of html_src) of html_src
	set content_str to text from character ((offset of xml_tag_name in xml_str) + xml_offset) of xml_str to (length of xml_str) of xml_str
	set result to text from character 1 to ((offset of "</SPAN>" in content_str) - 1) of content_str
end extract_document_info_from_file

-- Extracts the date from the document selection
on extract_document_date_from_document(news_document)
	-- set start of specified document
	set xml_tag to "<DIV CLASS=\"c3\">"
	-- grab the date of the specified document
	try
		set xml_str to text from character (offset of xml_tag in news_document) to (length of news_document) of news_document
	on error
		set xml_tag to "<DIV CLASS=\"c8\">"
		set xml_str to text from character (offset of xml_tag in news_document) to (length of news_document) of news_document
	end try
	set content_str to text from character ((offset of "</SPAN>" in xml_str) + 71) of xml_str to (length of xml_str) of xml_str
	set result to text from character 1 to ((offset of "</SPAN>" in content_str) - 1) of content_str
end extract_document_date_from_document

New glitch… given that not all the news items contain all the tags (e.g. Section and Byline), is there a way I can run a check on the selection and if it contains “SECTION” then run the routine to collect the section details? The script clearly needs some error handling too, but for now this would help it not place the author of the previous news item if the current one does not contain one.

It sounds like you need something like this:


Get all text between <DOCFULL> and </DOCFULL>
For every <DIV> tag in text:
    Get text between <DIV> and </DIV>
    For every <SPAN> tag in text:
        Save text between <SPAN> and </SPAN>
    If number of <SPAN> tags != 2 then discard saved text
    Else save text of first <SPAN> tag in "Key" list, save text of second <SPAN> tag in "Value" list.

This would give you a list of Keys and Values, such that item 1 of Keys might be “SECTION:” and item 1 in Values would be “Guardian Life Pages, Pg. 15”.

How to code this? Let’s make two helper routines:


-- this is going to be pseudocode as I am lazy; bear with!

on get_text_between_tags( src_str, tag1, tag2 )
    set start_idx to offset of tag1 in src_str
    set start_idx to start_idx + length of tag1
    set src_str to text from start_idx to (length of src_str) of src_str
    set end_idx to offset of tag2 in src_str
    set result_str to text from 1 to end_idx of src_str
    return result_str
end 

on strip_tags(src_str)
    set result_str to ""
    repeat
        set idx to offset of "<" in src_str
        if idx is 0 then
            -- no < in string
            set result_str to result_str & src_str
            exit repeat
        else if idx > 1 then 
            -- save all text up to the <
            set result_str to result_str & (text from character 1 to character (idx - 1) of src_str)
        end if

        -- set string to start of tag
       set src_str to text from idx to (length of src_str) of src_str
       
        -- find end of tag
        set idx to offset of > in src_str
        if idx is 0 then
            -- no > in string!
            -- malformed tag. could append it to result_str.
            exit repeat
        end if

        -- set start of string to character after end of tag
        if  length of src_str is 1 then exit repeat
        set idx to idx + 1
        set src_str to text from idx to (length of src_str) of src_str 
    end repeat
    return result_str
end

With these written, the algorithm becomes a bit more manageable:


-- more psuedocode!

on handle_div(src_str)
    set span1 to get_text_between_tags(src_str, "<SPAN", "</SPAN>")
    -- if no span tag then return
    if span1 is "" then return {}
    
     -- semi-redundant advancement past first SPAN tag
     set src_str to text from ((offset of "<SPAN" in src_str) + length of "<SPAN>" + length of span1 + length of "</SPAN>") to (length of src_str) of src_str

    if src_str is "" then return {}

    -- get second span tag
    set span2 to get_text_between_tags(src_str, "<SPAN", "</SPAN>")

    -- if no span tag then return
    if span2 is "" then return {}

    -- super good! return span1 as key and span 2 as value
    return { strip_tags(span1), strip_tags(span2) }
end

on handle_document(src_str)
    set keys to {}
    set vals to {}

    repeat
        set div_str to get_text_between_tags(src_str, "<DIV", "</DIV>"
        if div_str is "" then exit repeat

        set keyval to handle_div(div_str)
        if keyval is not {} then
             -- save key
             set end of keys to item 1 of keyval
             -- save value
             set end of values to item 2 of keyval
        end if

        -- advance past DIV block
        -- this is a bit lame: we are calculating offset of
        -- DIV a second time, plus the lengths of
        -- the two tags should be calculated outside the loop
        set src_str to text from ((offset of "<DIV" in src_str) + length of "<DIV>" + length of div_str + length of "</DIV>") to (length of src_str) of src_str
         if src_str is "" then exit repeat
    end repeat

    -- do something with keys and vals
    return
end

on handle_page(src_str)
    repeat
        doc_str to get_text_between_tags( src_str, "<DOCFULL>", "</DOCFULL>")
        set doc_len to length of doc_str
        -- was doc_str found?
        if doc_str is "" then exit repeat

        -- is doc_str big enough for DIVSPANSPAN?
        if doc_len > 37 then handle_doc(doc_str)

        -- advance past DOC         
        -- again, this is lame: we are calculating offset of
        -- DOCFULL a second time, plus the lengths of
        -- the two tags should be calculated outside the loop
        set src_str to text from ((offset of "<DOCFULL>" in src_str) + length of "<DOCFULL>" + length of doc_str +  length of "</DOCFULL>") to (length of src_str) of src_str

        if src_str is "" then exit repeat
    end repeat
end

Sure, it’s not ideal, but it should get the job done.

In handle_doc, you will have a list of key:value pairs. You can lookup a value by doing this:


on lookup_key(key, key_list, value_list, default_value)
	-- Lookup key in key:value mapping represented by 
	-- key_list and value_list
	if key is "" then return default_value
	repeat with i from 1 to count of key_list
		if key = item i of key_list then
			return item i of value_list
		end if
	end repeat
	
	return default_value
end lookup_key

set section_str to lookup_key("SECTION: ", keys, values, "")

Anyways this is all a bit off the cuff but should give you an idea of how to progress.

Some things to fix:

  • advancing past the tags after calling get_text_between_tags. This routine can be modified to return {found_text, offset_of_end_of_tag2}.
  • Normalizing keys. Probably all whitespace should be removed.

Happy scripting :slight_smile:

Would XML Tools 2 Scripting Addition (@latenightsw.com) be helpful for what Einar’s trying to do?

It certainly would do the job, but using expat (or any DOM parser) is not for the faint of heart :slight_smile:

I would recommend using the routines covered in their Applescript Utility Code for XML Handling:

http://www.latenightsw.com/freeware/XMLTools2/asUtilities.html

These wrap the scripting addition and greatly simplify the huge tree structure returned by the DOM parser. For example, getElements() could be used to get all DOCFULL elements; getElements() could be called on the XMLContents of each DOCFULL element to get each DIV element, and so on.

Thanks for the tips both of you! I’ve been snowed under with work and a house move so not had time to look into this. The original script helped me out in the short term, but I will work to incorporate your suggestions next time I need it and post any results to the forum here…

Again, thanks for taking the time to help me out!!