This query is similar to my question in the following thread, although the result / purpose is somewhat different so I figured I would create a new thread.
LexisNexis is an online database which I use to search print based news media. It allows users to download all or part of the results in a single document (I choose the HTML format since this presents the results most clearly). What would be great is a script that would take this HTML file once inside DevonThink and split it into several new items. This way I can utilise the power of DevonThink to analyse the results from LexisNexis, instead of having to wade through all the results on their rather cumbersome online interface.
Given the great solution edf came up with in the aforementioned thread, I’m pretty certain this is possible, but I have a fairly limited (read: non-existent) AppleScript knowledge so some assistance would be great!
I have uploaded a sample of the results file from LexisNexis here:
http://phishtank.info/einar_thorsen/files/pubGuardian.html
Each document starts with the following:
<!-- Hide XML section from browser
<DOC NUMBER=1>
<DOCFULL> -->
… where the number “1” changes for each document.
Each document ends with the following:
<!-- Hide XML section from browser
</DOCFULL>
</DOC> -->
<BR>
Each document then starts with the count of the given search, the copyright of where the text came from, the date of the original publication, the section where it was published, word length, headline and byline. Each of these are presented on a seperate line and is followed by the body text. Example of code:
<BR>
<DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">1 of 48 DOCUMENTS</SPAN></P>
</DIV>
<BR>
<DIV CLASS="c3"><P CLASS="c1"><SPAN CLASS="c2">Copyright 2005 Guardian Newspapers Limited <BR>
The Guardian (London) - Final Edition</SPAN></P>
</DIV>
<BR>
<DIV CLASS="c3"><P CLASS="c1"><SPAN CLASS="c2">May 5, 2005</SPAN></P>
</DIV>
<BR>
<DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c6">SECTION: </SPAN><SPAN CLASS="c2">Guardian Life Pages, Pg. 15</SPAN></P>
</DIV>
<BR>
<DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c6">LENGTH: </SPAN><SPAN CLASS="c2">878 words</SPAN></P>
</DIV>
<BR>
<DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c6">HEADLINE: </SPAN><SPAN CLASS="c2">Online: Inside IT: An autonomous source of news: The government might at last have found a winning technological combination, writes Bobbie Johnson</SPAN></P>
</DIV>
<BR>
<DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c6">BYLINE: </SPAN><SPAN CLASS="c2">Bobbie Johnson</SPAN></P>
</DIV>
<BR>
<DIV CLASS="c4"><P CLASS="c5"><SPAN CLASS="c6">BODY:<BR>
</SPAN><SPAN CLASS="c2"> </SPAN></P>
<P CLASS="c7"><SPAN CLASS="c2"> Dealing with truckloads of information is never an easy job, but when etc etc etc
The CSS code is as follows:
<STYLE TYPE="text/css"><!--
.c0 { text-align: center; }
.c1 { text-align: center; margin-top: 0em; margin-bottom: 0em; }
.c2 { font-family: Courier; font-size: 10pt; font-style: normal; font-weight: normal; color: #000000; text-decoration: none; }
.c3 { text-align: center; margin-left: 13%; margin-right: 13%; }
.c4 { text-align: left; }
.c5 { text-align: left; margin-top: 0em; margin-bottom: 0em; }
.c6 { font-family: Courier; font-size: 10pt; font-style: normal; font-weight: bold; color: #000000; text-decoration: none; }
.c7 { text-align: left; text-indent: 4%; margin-top: 1em; margin-bottom: 0em; }
--></STYLE>
Ideally the script would:
- Take the contents of the selected file in DevonThink.
- Create a new document containing the information between the tags mentioned first ( and ).
- Set the title of the new document to the “HEADLINE” tag.
- Set the creation date of the new document to the “date” tag (or if not possible then add this to the comments field).
- Add the “SECTION” tag to the end of the comment field.
- Add the CSS style sheet reference for correct markup to the start of the HTML code of the new document.
- Repeat 2-6 until the end of the selected file in DevonThink.
I realise this is perhap a bit more of an ask than the previous request, but I would really appreciate someone giving me a hand with this. I am sure many other people use LexisNexis too so hopefully this will be of use to others as well!