DT Pro Office Question

About the new OCR features:

Can it make not-OCR scanned PDFs already in my database searchable, or only those which I would theoretically scan using the new version?

Cordially,
J.

I was able to do this using File > import > images (with OCR) in the menu.

We would like to allow that but there are a few kinks that need to be worked out.

Are those “kinks” resolvable? I had the same question as jfv. Using research databases many articles are simply pdf images of books and journals. The ability to OCR these with out having to first print then scan would be valuable to most who do this kind of research.

Thank you for your efforts… I still plan to upgrade… it is just that this issue seems a natural one to address.

-bob

Bob, in the meantime – as almost always – there are kludges.

Whenever I come across an image-only PDF, I make a note of its location in the database, then drag it to the Finder.

In DTPO I select File > Import > Images (with OCR). Then I move the OCR’d PDF to the correct location and delete the image-only version.

I’m a little confused (not difficult to do)… in another post (devon-technologies.com/phpBB … highlight=) Bill says that existing PDFs can be processed by the OCR function of DTPO.

Annard, says there are “kinks” and here Bill refers to a “kludge”. I want to make sure we are on the same page…

Can I download through either DTP, DA or Safari an “image only PDF” from an academic data base and bring it into DTP using the OCR function producing a new searchable PDF file? I’m assuming the original downloaded file remains intact in the finder and can be trashed in favor of the new searchable one now located in DTPO.

Thanks in advance for the clarification.

-bob

The original question was whether an image-only PDF that had already been imported into a database could be OCR’d by DTPO’s File > Import > Images (With OCR) command.

The easiest way to do that is to drag the image-only PDFs to the Finder, select them and OCR/import them. You will then have two versions, one image-only and one PDF+Text, both stored in the database package’s Files folder.

If, on the other hand, you had Index-captured a PDF that’s external to the database package, you can run the OCR command from DTPO, selecting that file to process. You will then have two copies in your database: the original Index-captured PDF and a new PDF+Text PDF saved into the database package Files folder.

You will in either case probably wish to delete the image-only version of the PDF from your database.

Thank you for summarizing the two issues… I am now clear.

This makes DTPO a very good choice even if your scanner need is not critical. Many PDFs and Inter-Library loans are scanned as PDFs with out the text layer included in the scans. Making these “searchable” is very helpful.

Academic users may or may not need a scanner and might pass on DTPO. Making it clear that “image only” PDFs aquired from downloaded searches or inter-library loans can also pass through the OCR operation into DTPO might trigger a few more upgrades.

Just a thought.

-bob

Bob,

I’m in the same boat as you: having several image-only PDFs downloaded from the University library.

I can confirm that DTO will read and convert these documents via the new OCR engine, but it’s pretty darn slow. So slow I didn’t think it was working initially. I’ve just done a formal timing test: it takes about 5 minutes for the IRIS screen box to appear as it scans each page. It took 10-11 minutes to scan a 20 page journal document.

Subsequent searching for text in the document seems to work fine.

Time to import with an OCR scanned document, however, is quite fast.

Hopefully the process for image-only PDFs can speed up. At the worst I now know the homework I’ll assign to my computer while away for lunch!

  • Mathew

Thanks for the feedback… almost makes printing and running through the scanner more time effective… a few trees take a hit though.

I wonder if the slow speed you are encountering is part of the beta shake down? Would memory or disk space be an issue?

-bob

Bob,

I don’t think RAM is an issue on my end (I have 1.25 Gb). I also seem to have sufficient space on my hard drive. I also didn’t have other software open while DTO was running (aside from my email program).

I hope the lack of speed for image-only PDFs being converted is a beta issue and wil naturally get speeded up with the final 1.0 version.

In addition, the original 20 page PDF I had was less than 1 Mb in size. The OCR-version stored within DTO was 27 Mb! I’m not quite sure why that’s happening. Intuitively I would think text-only docs would take up less space than image-only docs.

  • Mathew

The issue of large files are being discussed. See here for discussion:
devon-technologies.com/phpBB … 8376#18376

A file growing by 27x is not going to work forever… Think of what that does to your remaining HD space after a few hundred of those!

I’m sure with a little time this will be addressed.

-bob

I downloaded DTPO to try. I was able to import a few PDFs from an Inter-Library Loan that were image only. I have a few questions about the size of the PDFs imported vs. PDFs when exported from DTPO.

  1. I imported a PDF image only article from the finder to DTPO. The pdf in the finder is 4.5mb. After import to the DB it is 95kb in DTPO.

  2. When I “export” the scanned file from DTPO back to the finder (I renamed it too) it is 13.7mb.

Can you explain what DTPO is doing to make a file so small in the database but large in the finder. Is this newly created “giant” lurking in my database somewhere? I am concerned about DB size over time. Are the scanned files compressed inside the database and only get larger with the export?

This is a great feature. The import was quick and painless. This 13 page document w/ multiple columns took under 2 min. to import.

Thank you very much.

-bob

Hi, Bob. No, the scanned and OCR’d PDFs don’t grow magically when exported or dragged to the Finder.

In the Info panel you will note two figures related to size. The number on the left is the information DT Pro retains in the database for the text and the number to it’s right is the file size.

So, for example, I have an OCR’d PDF that has a Size of 146 KB and a File Size of 8645 KB.

But yes, the OCR’d PDF is considerably larger than the original image-only PDF. During OCR the image is rasterized again by the IRIS OCR engine. We cannot control this process at the moment but hope that IRIS can provide options for the size of the PDF file. Note that it’s also possible to reduce the image resolution in various ways, including OS X ColorSynch settings (which I find to produce rather fuzzy results) or utility applications. Annard is looking into these issues.

Thank you. I will trust that a solution will be developed for the increased file size. I like the new features and will upgrade tonight.

Bill, you have been using this for some time… I think I read in the forum that you scanned 2,000 pages so far… Has this slowed your DB? What strategies for DB size are you using since this OCR function was created?

I have yet to buy a scanner but have many image only PDFs downloaded during research… Hopefully a ScanSnap will be in my future.

-bob

Bob, the ‘ballooning’ of the OCR’d PDF file size doesn’t really make a significant difference to the database’s overhead for managing the searching and analysis of the document’s text.

In other words, DT Pro doesn’t care whether the image resolution is 72 dpi or 1200 dpi. The size of the text and metadata in your database is the same.

But of course the amount of disk storage space can make a difference, especially if you don’t have much of it. :slight_smile:

So we would like to see IRIS provide options for reducing the file size of PDFs, or find other means available to users for doing that. Of course, there’s always a kludge: get a bigger hard drive. :slight_smile: