This isn’t really a request or bug, in fact I’m not sure where it really needs to go! Moderator, please move to where you see fit.
I’m working on a project that requires me to search and harvest a collection of webpages about a particular topic. I love the flexibility that DevonAgent gives me in defining the search engines, even writing plugins for my own, etc. My initial workflow was something like this:
Perform Search Using Selected Plugins
Add All to Archive
Remove Obvious Mismatches (results not related to topic)
Export Results As TXT and HTML
The project is for a client that has a PC bias and hence they were less than happy with my first harvest - the filenames of some results contained invalid characters for the PC, plus the line breaks/carriage returns were UNIX style. Finally the text format was MAC Roman. Hence I now post process each result with a toolchain that goes something like this:
Use ‘A Better Finder Renamer’ to rename the files HarvestXX.txt (I tried a few other alternatives here, including using their ‘Safe for SMB’ option, although some characters still got through, like ‘?’ if I remember correctly and also some filenames were too long - in the end I just renamed them all HarvestXX.txt where XX is incremented for each file). The problem here is that I lose the file name that includes some detail as to where the file came from - I need to somehow put this into the file. Shouldn’t the full URL for a search result be accessible somehow (either inside the file or as a comment or something?) - it’s pretty important, actually essential that we can get get back to the original source (not important that this may have changed since, e.g., front page of CNN).
Next, run all the text files through a freeware application called TODOS.exe which converts the line breaks.
Finally, run a OSX command line:
find . -name *.txt -exec textutil -convert txt ‘{}’ ;
to convert the text format to UTF8. Actually, I’ll probably start doing these last two steps in the opposite order so I can move from the TODOS.exe stage right into archiving and sending instead of bringing them back to the Mac side.
While I’m still trying to sell this as an option, I’m coming under a lot of pressure to use a PC harvester. I’ve not been able to find any that match Agent. I ended up buying Copernic Agent Pro to find out that there is no way to save the results of searches to either one big file or individual file - their support people said it is something they might consider for a future version.
So, apart from waffling on, I guess the questions I have are:
-
Have I missed anything in the toolchain? Are there any other potential pitfalls that might hit using these postprocessed text files on a PC?
-
How do I get access to the URL of the source document after I’ve exported? Can it be inserted into the file?
-
Are there any other tools out there that might make my task easier for this PC focused client? Note, this certainly won’t stop me using Agent for all my other harvesting tasks, it certainly is an excellent program and a bargain at the price.
-
Would I have less issues just providing the HTML versions of the source documents?
Any help at all on this matter would be appreciated! Thanks for reading.