Save OCR for indexing rather than import

I would like to see an option to save an OCR-ed PDF outside of the DevonThink database, preferably along with an option to index it at the same time.

I prefer to have my documents filed separately and indexed into DT. To accomplish this now with documents in DTPO, I have to perform the OCR, import the document, view it in the Finder, delete it from DT, move it to where I really want to keep it, and index it back into DT. That’s a lot of steps!

I like having DTPO handle my OCR, but I wish it would let me keep my own file structure easily.

I don’t think we will implement this but nothing is stopping you from using Automator to achieve your goal (note that I haven’t tested this):

[Set Current Group] - use a temporary group whose contents will be erased later
[OCR Items]
[Export Records] - don’t use temporary files and show the action so you can select the folder
[Get Specified Records] - drag the temporary group in here from DT
[Get Group Contents] - enable “Exclude input group records”
[Delete Records]

That should do the trick, you can test it and if it works save it as a Finder plugin.

Thanks for the response; I never would have gotten anywhere near this with Automator.

However, it doesn’t work. I get an error “No items in input” at the OCR stage. Yet I have an item already in my temp folder.

At any rate, if I am reading it correctly, this workflow doesn’t quite do what I want, which is perform an OCR scan and allow me to index rather than import the results. If I understand the action description correctly, the OCR Items action doesn’t actually do a new OCR scan, it merely subjects an already-existing pdf or image to OCR.

If there’s a workaround or fix, I’d like to know.

It occurred to me that the workflow might be looking for an item in the Finder rather than a DT window, so I ran it on a scanned file and got as far as exporting the file before “Workflow failed.” I don’t know how to determine what the failure was at that point; my Automator skills are minimal.

Two issues I’ve noticed so far: there’s no way to specify the exported filename (apart from changing the original scanned filename), and there’s an extra “DevonThink Storage” file left behind.

At any rate, I don’t see how the remaining section of the workflow will do what I want, which is index the OCR-ed file into the appropriate DT group. But as I say, I don’t know much about Automator.

Some comments about Index-captured databases and their organization:

I’ve often said in the forum that my databases result from Import of material captured from the Finder, rather than Index captures. The reason I usually mention for that is that I want my databases to be self-contained so that I can freely move them among different computers.

That’s true, but there are other reasons also.

I prefer setting up or changing organization of contents in the database itself, rather than in the Finder. Let me list some reasons:

DT’s replicants are logically more powerful than the Finder’s aliases of files. For one thing, during the process of tinkering with organization in a DT database I don’t have to worry about deleting the “wrong” file (in the Finder, the original) and thereby losing information. In DT, in a case where I have two instances (replicants) of a document, it doesn’t make any difference which one I delete; the remaining one is unaffected.

If I have Index-captured material from the Finder and subsequently change the organization of material in the Finder, it is possible to lose information content in the database and/or raise issues about synchronization of content after it has been modified in the Finder. Example: I move a database document out of a Finder folder that’s synchronized to a database group, and into an Finder that’s not synchronized. That item disappears from my database – I’ve lost information. Did I intend to do that? Maybe, maybe not. The possibility of human error rears its ugly head whenever I tinker with Finder organization of material that has been Index-captured to my database. I don’t like that.

DT provides tools for assisting database organization that are not available in the Finder. I often interact with DT’s artificial intelligence assistants including Auto-Group, Auto-Classify, See Also, See Selected Text, Words, Similar and Context to help me organize material. Even the simple Group and Ungroup commands, which do have counterpart routines in the Finder, are available in the database in a larger variety of view and sort environments. And I often replicate the results of a search into a new group.

Bottom line: If you are going to use Index captures, do essentially all organization in the Finder. (But I find that too limiting.)

The only files I Index-capture are Word .doc files (which I avoid whenever possible, which is mostly always). To make it easier to transport my databases among computers, I put my .doc files into a single folder (for that database) in my Documents folder. To move to another computer, I have to copy both the database and the associated folder containing its .doc files to the other computer’s Documents folder. Now I don’t have to worry (for the most part) about the logical pitfalls of Index-capture and synchronization. I can still easily edit the Word files, and individually synchronize a file after editing.

While I’m greatful for the advice in index vs. import, I don’t see anything in your list that can’t be done with indexed files, nor any reason I would have to do my organizing in the Finder rather than DT—all the features you cite are available for indexed files, as far as I can tell. I have rarely used DT’s AI functions, largely because I have rarely found them useful—maybe some day they will prove otherwise, but I believe they will work whether the files are imported or indexed.

As for moving DT files between computers, I never have any call to do that in the first place, and in the second place the DT file is already so big it would be impractical to move very often. Plus, corruption of this one file could result in the loss of all data. I prefer to have my data stored separately and easily available to Spotlight, Google Desktop, and other programs.

That being said, none of this speaks to the issue I’m having. I guess I’ll just have to do what I want done manually. Why is it that I always seem to spend more time trying to Automate something than I would ever gain by Automating it, especially since in the end nothing gets Automated?

Text-type files such as rich and plain text, HTML and WebArchive files are stored in a monolithic database rather than as individual files in the Finder. So corruption of the database can make those tiles unrecoverable.

That’s not the case with PDF, postscript, image and QuickTime media files, which are stored as individual files within the Files folder, inside the database package files. In the case of database corruption those files will likely not be damaged or lost – unless your computer has serious directory problems, in which case all files are at risk. In the Finder you can look inside the database package file and discover your PDF files, which could easily be copied to another location.

The version 2.0 database structure will be substantially revised, and all files will be stored like PDF files are, in the Finder. And they will be visible to Spotlight.

Yes, you can do everything with Index-captured files that you can do with Import-captured files. But there really are logical pitfalls that can “bite” in the case of Finder and database reorganization of Import-captured files.

I’m currently managing more than 150,000 documents among a number of topically designed DT Pro databases. If I were to try to merge those into a single database it wouldn’t fit on my MacBook Pro with 100 GB hard drive (my Power Mac has 1.5 terabytes of online storage). Moreover, it would be slow and unresponsive, as it would force continual usage of Virtual Memory. I’m spoiled. I like most search queries to take less than 100 milliseconds.

I find that using topical databases works well. There are very few cases where I feel the need to duplicate any material in more than one database. And that rare need will go away in version 2.0. My main database provides a wide-ranging and comprehensive set of reference materials and notes (about 23,000) for my professional interests in environmental science, technology and policy matters. There’s no need to mix in my financial database which contains lots of detail about my financial accounts, taxes, etc. My email archive with about 25,000 messages constitutes still another database, and so on.

I try to keep my databases to a maximum size of no more than about 24 million total words, so that they are quick and responsive on my MacBook Pro with 2 GB RAM. (My Power Mac G5 dual core has 5 GB RAM, so can handle much larger databases without needing to use Virtual Memory.)

Another important advantage of topically-designed databases is that the artificial intelligence features become more focussed and effective (and fast). That really helps with literature research.

My next Mac laptop will have 4 GB RAM, so I’ll be able to handle multiple open databases – still with fast performance – when that becomes possible. :slight_smile:

Example: I have another environmentally-related database that is about the same size as my main database, but deals with the details of chemical analytical methodologies, statistical data evaluation procedures, sampling design procedures and similar technical literature. The AI features work much better (for both databases) with the split of this material from the main database. But once in a while I do find it useful to switch from a question raised in my main database to the technical procedures involved, contained in the auxiliary database.

What I forgot to write is the dreaded (in my university days): “the rest is left as an exercise to the reader”. I just wrote up a quick skeleton to get you going.

Automator is actually quite straightforward compared to AppleScript but it will of course take some effort to master it. But in your case it might be worth it since it could save you some work in the end. Just imagine what you would have to do yourself to get the rest done and try to find those “actions” in Automator. One hint: after the Export action, you don’t want to use its output for the next action’s input if that one requires “records”. You can disable that.

I agree with you that it seems often that automating something may take more time than it will gain but this is all relative on the amount of times you’ll think you’d need that automation. And of course as your skill-set grows in creating the automation process, it will become easier over time. It’s up to you to decide whether it’s worth the effort.