Questions about DTPro Office

danzac · March 22, 2007, 7:25pm

I have been thinking about the upgrade because of the OCR of DT Pro Office. I have a few questions about this.

Am I understanding correctly that applying OCR to PDF’s create an invisible text layer? So when I read the PDF I’m still looking at the image PDF, right?

Second, assuming the above is true, does the ability exist to highlight PDF’s (i.e. the text layer)?

Third, sometimes you get PDF’s (or pages that you are scanning in) that have two pages side by side in a landscape view. Does the OCR recognize the multiple pages on one PDF page? This potentially matters because it might see the next word in the sentence as a word on the same line but on the next page, which in turn might hamper a search.

Fourth, there are inevitably going to be misread words with the OCR. Once you have many OCR’d documents this can add up to a lot of nonsense words. Would this not affect the AI? And also, would this not populate the concordance feature. (p.s. love the fuzzy word search).

Fifth, OCR technology continues to be updated. How soon after OCR updates will they be incorporated into DTPro Office? And, if there was a major boost to the OCR, could the OCR be re-applied to files in the database (maybe automatically)?

It will probably be Bill DeVille who answers, so thanks Bill

milhouse · March 22, 2007, 7:38pm

After scanning, the pdf is still a pdf and looks the same (except it has a text layer). PDf highlighting is not available in DTPO.

Never come across landscape and portrait pages in the same document while scanning. One would think the OCR engine would recognize the change in orientation and scan “pages” based on the page break or whatever element is standard.

I brought up the issue of mis-scanned words and AI issues. Yes, it can be somewhat of a problem (at least in theory) but most who replied to my query said it wasn’t a show stopper.

I do see the misspelled words in word lists and, depending on their frequency, could (in theory) impact the concordance function as well as other AI functions.

As a note, I have a database with close to 2000 scanned pdf files. I have yet to see a significant AI-based mishap that would lead me to believe it was related to some small number of mis-scanned words.

YMMV

danzac · March 22, 2007, 8:14pm

Thanks for the quick reply. I know that PDF highlighting isn’t available in DT (although I think on the forum they have indicated that they are looking into it).

What I was wondering at specifically was if a highlight could theoretically be added to the text layer.

But I think I may just be misunderstanding the technology. When it is called an invisible layer, I imagine an invisible text sitting ‘on top’ of its PDF, so that if the text layer could be highlighted, it would look like the PDF was highlighted. But I think I must be misunderstanding the 'invisible; layer.

In any case, thanks for your help.

Danny