Massage Results for PostProcessing on PC

ajcowell · October 1, 2007, 3:43pm

This isn’t really a request or bug, in fact I’m not sure where it really needs to go! Moderator, please move to where you see fit.

I’m working on a project that requires me to search and harvest a collection of webpages about a particular topic. I love the flexibility that DevonAgent gives me in defining the search engines, even writing plugins for my own, etc. My initial workflow was something like this:

Perform Search Using Selected Plugins
Add All to Archive
Remove Obvious Mismatches (results not related to topic)
Export Results As TXT and HTML

The project is for a client that has a PC bias and hence they were less than happy with my first harvest - the filenames of some results contained invalid characters for the PC, plus the line breaks/carriage returns were UNIX style. Finally the text format was MAC Roman. Hence I now post process each result with a toolchain that goes something like this:

Use ‘A Better Finder Renamer’ to rename the files HarvestXX.txt (I tried a few other alternatives here, including using their ‘Safe for SMB’ option, although some characters still got through, like ‘?’ if I remember correctly and also some filenames were too long - in the end I just renamed them all HarvestXX.txt where XX is incremented for each file). The problem here is that I lose the file name that includes some detail as to where the file came from - I need to somehow put this into the file. Shouldn’t the full URL for a search result be accessible somehow (either inside the file or as a comment or something?) - it’s pretty important, actually essential that we can get get back to the original source (not important that this may have changed since, e.g., front page of CNN).

Next, run all the text files through a freeware application called TODOS.exe which converts the line breaks.

Finally, run a OSX command line:

find . -name *.txt -exec textutil -convert txt ‘{}’ ;

to convert the text format to UTF8. Actually, I’ll probably start doing these last two steps in the opposite order so I can move from the TODOS.exe stage right into archiving and sending instead of bringing them back to the Mac side.

While I’m still trying to sell this as an option, I’m coming under a lot of pressure to use a PC harvester. I’ve not been able to find any that match Agent. I ended up buying Copernic Agent Pro to find out that there is no way to save the results of searches to either one big file or individual file - their support people said it is something they might consider for a future version.

So, apart from waffling on, I guess the questions I have are:

Have I missed anything in the toolchain? Are there any other potential pitfalls that might hit using these postprocessed text files on a PC?
How do I get access to the URL of the source document after I’ve exported? Can it be inserted into the file?
Are there any other tools out there that might make my task easier for this PC focused client? Note, this certainly won’t stop me using Agent for all my other harvesting tasks, it certainly is an excellent program and a bargain at the price.
Would I have less issues just providing the HTML versions of the source documents?

Any help at all on this matter would be appreciated! Thanks for reading.

cgrunenberg · October 2, 2007, 4:47pm

A future release could export this to the Finder comment but in your case this would be still useless for the dark side without post-processing.

Fixing the filename and the text encoding should be sufficient.

That’s not yet possible, see above.

Not yet as the HTML export doesn’t include the URL.

Anyway, the easiest and best solution is probably to insert the URL to exported text files and to set the base URL of exported HTML files. A future release of DEVONagent will do this, just send me an email (cgrunenberg-at-devon-technologies.com) if you’re interested in a beta implementing this.

ajcowell · October 2, 2007, 4:56pm

Thanks for you quick reply. Just throwing the URL inside the document would definitely work, so yes, very interested in a beta that could include that.

I believe the text encoding and filenames are all need to enable processing on the PC side, but I’ll let you know if anything else pops up. Perhaps these could be options in the export dialog of a future release (text encoding when selecting TXT as output and a filename option that either gives you a mask (harvestxxx.txt) or simply outputs to 8.3).

Thanks again!

A.