a look into the future of information management?

awindbrake · March 7, 2013, 1:38pm

Ok, this subject may be a bit off-topic but anyhow, maybe someone has similar questions, and somebody else maybe a solution…

On a daily basis, I receive newsletters mostly in PDF-versions. These newsletters contain market data to countries, customers, etc. Each newsletter comes in a non-text PDF with about 20 pages each. Ok. The workflow is to copy these files (as they come in by email) into a specified dropbox folder, which has a folder-action attached to import the document to DTP, convert this pdf into pdf+text and delete the source-file in dropbox. That’s automatic, simple and works great. But it’s only about how to get the info INTO the database. What is more important (to my mind) is how to get selected pieces of information OUT again, once I search for it. Of course, the search funktion does its job, it finds a lot of documents (some useful, others not), but going through all documents is a cumbersome task and I wonder if there are alternatives.

What if there existed a program which scans through each OCRed pdf, finds out - by means of artificial intelligence based on format, font-size, content (?), etc. - where each of the news-articles begin and end and copies each article to a new text-only file. Basically 1 pdf would be transformed into a bunch of about 20 (txt or rtf) files, each only dealing with one topic, which is the headline of the article. The advantage is that you would be directed much quicker to the desired information, without having to scroll down 20 pages to find a meaningless foot-note in page 21 matching your search-term…

I know that out there in the wild there are some serious scripting gurus (unfortunately I am not). Has anyone an idea if something like this is possible?