Duplicate Scanning w HP Officejet Pro8500A

rlasanta · May 4, 2011, 1:28pm

I have been using the trial to finally decide if I want to go w DT pro office. After a while now I am getting duplicate files every time I scan. If it is a multiple page document, I get the first page by itself and then on a second file I get the whole document; so at the end I end up with two files per document I scan.

I have tried everything in the book and can not come up with anything. I e-mail support team but no answer. Maybe this is not the product I should by…

Can anybody help me?

Bill_DeVille · May 4, 2011, 4:18pm

Your query to Support was logged in Wed, May 4 2011 4:08am.

As you provided more information in the forum post, I’ll respond to it.

The general response when one reports duplicate copies of a PDF that has been OCRed assumes this scenario: An image-only PDF had been imported into the database, and subsequently it was selected and Data > Convert > to Searchable PDF was invoked. Depending on the setting in Preferences > OCR for the ‘Original document’ option, if that option is checked the original is moved to the Trash and only one copy, the searchable PDF, remains in the database. But if that option was not checked, both the original and OCRed PDF documents will remain in the database.

However, you said that you now have a multipage PDF and another PDF that contains only the first page of that document.

Two questions:

Is the multipage PDF searchable, or not? If its Kind is listed as ‘PDF+Text’, then it has been subjected to OCR. If its Kind is listed as ‘PDF’ then it remains an image-only, non-searchable PDF. Likewise, what is the Kind of the 1-page version of the PDF?

Note: You can add the Kind column to a view window by invoking View > Columns > Kind, or open the Info panel of a selected document (‘Shift-Command-I’).

There’s no information about the scanner model or the mode used to send output to DT Pro Office for OCR. Is it possible that the scanner setup was configured to save the output file to the ‘Inbox’ folder and also to send it to DT Pro Office for OCR? I suspect that may be what happened. It’s a bad idea to save scanner output to the ‘Inbox’ folder, as only the first page of a multipage document might be captured.

In the case of setting up a ScanSnap using ScanSnap Manager, the default location for saving files to a folder is the Pictures folder in the user account. That works well. If DT Pro Office Preferences > OCR is set to send the original PDF to the Trash after OCR, the original scanner output file will be removed from the Pictures folder after OCR. But if the scanner output had been sent to the Inbox folder, it can’t be removed, because it was immediately flushed to the Global Inbox.

rlasanta · May 4, 2011, 11:00pm

Thank you for your reply. To answer your first question; the file type is PDF + text.

For question number two; my scanner is the HP Officejet pro 8500A Plus. I am scanning using the auto feeder and the set up is to scan to inbox and to delete original file. I just tried what you said scanning it to desktop or picture folder and then importing it to DT w OCR and it seemed to have worked. Is there a way I can set it up differently so I don’t have to do that second step meaning that I could scan directly into DT?

Bill_DeVille · May 5, 2011, 12:39am

I don’t know whether your scanner can be controlled from DT Pro Office using the Image Capture or ExactScan Capture modes under the File > Import menu. Probably not.

But there’s another way you can set up for automatically sending the output of your scanner to DT Pro Office for OCR and storage of searchable PDFs.

Create a new Finder folder that is to receive your scanner’s output. For the sake of illustration, I’ll call that folder “Harry”.
In the Finder, Control-click on “Harry” and choose the contextual submenu “Services”, then select “Folder Actions Setup”.
Choose the script named “DEVONthink - Import, OCR & Delete”. You will find this script at ~/Library/Scripts/Folder Action Scripts/.
Attach that script to “Harry”.

Now operate your scanner using its provided driver software after configuring it to save scanner output to “Harry”. DT Pro Office should be running.

Each time a new scanner output file is saved into “Harry”, the attached Folder Action script will send it to DT Pro Office for OCR and storage of the resulting searchable PDF, then send the original image file to the Trash. The folder “Harry” will therefore be emptied as each image-only PDF is sent to it and then forwarded to DT Pro Office for Import and OCR.

Especially if the new file sent to “Harry” is a large one, nothing may seem to have happened for a time. To verify that the script has sent the image file to DT Pro Office for OCR, switch to DT Pro Office and choose in the menubar Window > OCR Activity. When processing is complete, the script will then delete the image file from “Harry”.

adolphus · May 11, 2011, 4:36pm

I would just like to thank you for this very helpful suggestion. It helped me with a problem before I ever posted it.

I do have one follow up. I have tried this method with an Officejet 6500, Mac 10.6, and Image Capture and it works great with one hitch.

When I use the document feed, I get multiple copies of the same document. The first with the first page, the second with pages 1&2, the third with pages 1,2,&3 etc etc. This is an annoyance at worse, because I can always just delete all but the last document. But this seems like a weird glitch. Is there anyway to suppress it and get just one document at the end?

Bill_DeVille · May 12, 2011, 2:03am

Are you using the driver software supplied by HP to control the scanner and save its output to a designated folder?

My scanners, a Fujitsu ScanSnap and a Canoscan LIDE 500F, do not save PDF output (operating under their native software drivers) to disk until the multipage scan is complete, so they wouldn’t send multiple partial output files to “Harry”. I’ve never used an HP scanner, so you’ve got me puzzled.

adolphus · May 12, 2011, 8:08pm

I did not initially use the software supplied by HP, I used Image Capture which is come preinstalled with OS X.

I have had this printer/scanner for a few years and have bad memories of HP software. After I read your reply I went and redownloaded the whole package and reinstalled. Using HP scanner is worse than useless. It won’t save directly to a folder which sort of defeats my purposes and your suggestion. I might as well just scan into Preview or Adobe Acrobat and save into “Harry.” Those programs have much more robust document manipulation tools.

I have been trying to import through Image Capture and DTPro, but it just doesn’t seem to play well with my scanner and certainly not the sheet feed. And at least half the time it spontaneously closes during scanning through this method.

So at this point I will type up my exact problems and symptoms and post under a different thread. I seem to have outgrown this one.