New User with Scanning, OCR and general Q's

sawxray · July 22, 2009, 2:21pm

I apologize in advance for a long post, with many questions. For me, the main expense is the investment in time learning a new program and committing to it, and DT Pro Office looks powerful, but with an increased demand to really use it well.

I bought my license based on a number of excellent reviews, because I am doing research which will likely make great use of the AI, and because the concept of the program appeals strongly to me.

I have just started using DT Pro Office 2, and have a number of questions before I devote a lot of time to the program. Many of my questions revolve around the utility of the program for my personal life, not just my professional life.

At present, I have created a folder hierarchy for all my personal documents, bills, correspondence, health records, computer manuals and programs, etc. I have been using a ScanSnap to convert all my papers to digital files, and am about 90% complete.

For my professional research, I have numerous pdf’s, with multiple Pages documents, Keynote presentations, graphics files, etc. Again, I have a folder hierarchy with most of this material in digital format. I use Papers to organize my pdf’s, and think it is great so far.

I have considered diving into tagging, probably via Leap.

Questions:

Integrity of the database

Is the database saved as a single file? If so, if I save hundreds of bills and documents, and the database file gets corrupted, I would lose a lot of data, rather than a single document or two. Mind you, I use Time Machine (backing up to a RAID NAS) and off-site online backup. Nonetheless, is this a concern, should I keep several smaller databases, any other comments?

Scanning

I have not yet installed the optional OCR software that comes with DTPro Office. Is it better than the Abbyy software that came with the ScanSnap S510? Can I use this software and scan to my folder hierarchy at times, versus my DTPro? Will the ScanSnap manager provide flexibility in this regard?

Originals vs. copies vs. aliases

I have read about this on a number of threads. In general, I understand that originals and copies will increase the size of the database, whereas aliases can be used for AI evaluation, but links could potentially be broken. If I am incorrect, please correct me. My main question in this regard concerns the data integrity and database file size.

Do people OCR everything? Given the above, and the larger file sizes, is it necessary to have a searchable utility or credit card bill, rather than having these bills lumped together by type and year? In that latter case, one could, in the unlikely event of needing an old telephone bill, eg, just search for the document by month and year, rather than by its contents. Also, given that I use Moneydance for my financial data, how do people here feel about storing bills as pdf’s without OCR, since the contained data will be searchable in Moneydance? Has anyone any concerns about the size of their database file if they OCR everything?

I know I have asked a lot of questions! I will probably think of a few more as I learn to use this powerful program.Thanks in advance for any bits of advice for this newbie!

bartekb81 · July 22, 2009, 2:29pm

Since I do not scan a lot of documents I’ll only answer the first question - no, the database is not a single file, thanks God. It may look like this in Finder but actually it’s a folder with subfolders and files inside - Actually, each document in your database corresponds to a file on your hard drive (more or less, it’s more complicated .

sjk · July 22, 2009, 4:15pm

Wish there was an easy reference to the many previous discussions of that topic. The search results for database backup* might be fruitful.

sawxray · July 24, 2009, 6:22pm

OK…

I have read about Bill’s ‘Backup Archive’ script, which sounds great, but still cannot figure out if the database (or each database) is a package with all individual files, or a single file. I have seen different opinions on this forum. Any official confirmation?

Also, does anybody schedule a ‘Backup Archive’ action (so as not to forget it), or just remember to do so manually?

I have also read a lot on these forums about file sizes and OCR, with the improvements over time. I have not figured out if one can only scan to the program, or can scan and OCR for a file that will be used in a folder hierarchy in the Finder. Perhaps this isn’t necessary for someone who wants to use DTPro Office for all their documents. I was mostly looking at it as a tool for research, and have been happy so far with my folder hierarchy.

I am open to change, though, if anyone has gone through the same transition as I am considering!

I am impressed with all the activity on these forums. This is obviously a very active community, which bodes well for the growth of the program!

acl · July 24, 2009, 7:34pm

If you’ll settle for direct inspection over authority , you can do the following: display a database in the finder and tell it to “Show package contents”. You’ll find (amongst other things) a folder named “Files.noindex”. This contains folders named eg “pdf”, “rtf” and so on. In these there are other folders. In those folders are your files.

sawxray · July 24, 2009, 8:20pm

Hi and thanks!

I didn’t get the part about ‘inspection vs authority’…I must be missing something!

I believe you are confirming that the new database structure is not proprietary, and that corruption of a single file will NOT disrupt all the other files in the database?

Thanks

acl · July 24, 2009, 9:53pm

Stupid joke, never mind. I do not represent Devontech thus my explanations carry no official weight. However, doing what I said will show you that databases are stored as packages, which are actually a type of folder. The files you store in a devonthink database can all be found under various subfolders inside the database package, thus they are in their original form. The folder structure you set up in devonthink isn’t reproduced in there, though; you have to use the export script to get that.

So your files are roughly as safe/unsafe as anywhere else on your filesystem.

sawxray · July 24, 2009, 10:12pm

You’re the best! Thanks. That is exactly what I wanted to hear.

I thought I was the one with the stupid jokes…(not currently on display)

acl · July 25, 2009, 2:13am

Just to be completely clear, the files are stored in a directory hierarchy that is quite cryptic. For example, under Files.noindex, there are folders “pdf”, “rtf” etc; inside those are folders “1”, “2”, “a” and so on, and the various pdf, rtf etc files are scattered in there. So should DT refuse to open the database to whatever reason, you can still go in there manually (in the finder) and find your files, but you’ll have to hunt around to find them.

DT seems quite robust; I haven’t had it refuse to open a database in more than a year of heavy use, and you can ask it to verify the database. And to judge from the small number of reports of such corruption happening in these forums, it seems I am not alone.

Mine tend to be incomprehensible to most, and not funny once comprehended, either… Oh well.

sawxray · August 23, 2009, 5:38pm

Since I last posted, I have been playing gingerly with DTPro Office. I have installed the OCR, installed direct ScanSnap Manager support, and tested scans to find a nice combination of image quality and file size (after reading many posts and experimenting-Thanks, Bill!).

I have been a bit hesitant to replace my current folder hierarchy…but am considerably more open now. So…a few more questions, and a repeat question or two:

Have people replaced their entire folder/file hierarchy by re-creating it in DTP?
a. If so, is it a simple matter to import all current folders/files and run them through the OCR while importing? Are these folder hierarchies maintained (which would enable easier classification with each new scan/import)?
Originals vs. copies vs. aliases

I have read about this on a number of threads. In general, I understand that originals and copies will increase the size of the database, whereas aliases can be used for AI evaluation, but links could potentially be broken. If I am incorrect, please correct me. My main question in this regard concerns the data integrity and database file size.

The main reason is that I use Papers (an excellent program) to maintain my pdf database for medical research, and don’t want to duplicate all the pdf’s (for size reasons) if I don’t have to. So I would like to use Papers and pdfs in a folder outside of DTPro Office, but have the AI of DTPro Office to help me with my research. If it is better for me to import and copy all of these pdf’s, so be it!

Does anybody schedule a ‘Backup Archive’ action (so as not to forget it), or just remember to do so manually? Is there a script for this to occur regularly?

Thanks!

acl · August 23, 2009, 8:59pm

I use DTPO to store all the documents that are meant to be read by myself (as opposed to by my computer). I keep my papers (physics), notes, travel information, receipts, certain emails (grouped with the appropriate papers, for example) etc in it. The only files I don’t keep in there are things like tex files and similar (and I would if I could do it conveniently).

So for me it has completely replaced keeping stuff in directories in my filesystem, mainly due to the searching and classification capabilities.

This I don’t know; I didn’t import the hierarchy the way it was before.

I tried using papers for a couple of months too. It’s nice but in the end I gave it up as I found it too limiting (I can’t remember exactly what, but probably because it was only useful for me if I remembered one of the authors–searching for keywords did not work as well, but I don’t remember why).

I was not very successful with indexing the papers database from DT because of a peculiarity of the physics preprint repository (arxiv) pdfs: they are garbled by OS X, so I’d need to ocr them to be able to search them (or download the tex and figure files and typeset them–this often turns up amusing comments in the source, which the authors presumably forgot to remove). Papers doesn’t recognize the change once you ocr, so its automatic online search and import (which is what I liked about it) wasn’t that useful in the end.

I just set a reminder for myself because I export it to an external drive, not always connected. Since this export is done by a script, though, you could simply schedule it using ical.

Bill_DeVille · August 24, 2009, 3:03am

Yes, you could Import your entire folder/file hierarchy into a database, and the database would maintain that hierarchical structure in resulting groups/documents.

But I don’t do that, for several reasons:

I often end up with files on my computer that I’m not very interested in – they have little or no long-term value. Why should I bother to import them into a database? I’ll probably end up deleting them anyway.
I’ve got lots of files that are better managed in a special-purpose database such as iPhoto or iTunes. I’ve got some photos, movies and audio files in some of my databases, but only because they add information value for the purpose of that database – such as photos of cleanup stages at a hazardous waste site.
There are some pretty large files in my Documents folder that would add no value at all to a database, such as MS Office User files and databases.
I use topically designed databases that reflect a particular interest or need. That keeps my databases responsive in their RAM requirements, and also improves the focus of searches and See Also operations.

That means that I might split up the contents of a set of folders among more than one database.

No. If you select multiple PDFs using the command File > Import > Images (with OCR) you will end up with searchable PDFs in your database, but they will not have the organizational structure of your Finder folders.

Suggestion: Import your folders/files. Then examine the contents of each resulting group. The Kind of image-only PDFS is ‘PDF’. The Kind of searchable PDFs is "PDF+Text’. Add a Kind column to your view window for easy inspection and sorting.

Remember this: If you select all of the non-searchable PDFS within a group and choose Data > Convert > to Searchable PDF, they will remain in that group. If you have selected the option in Preferences > OCR to delete the originals, they will be deleted and you will end up with only searchable PDFs in that group.

But if you select multiple PDFs from more than one group and use that command to convert them to searchable PDFs, they will all end up in one group.

An Index-captured document is NOT an alias of the external file. It does not behave like the alias of a file. The database has read and indexed the text content of the Indexed file, keeps the Path to the external file and refers to it for displays, although displays are not identical to those of the original file under its parent application, except for certain filetypes such as text, PDF, Postscript, HTML, WebArchive, etc. Many filetypes can only be represented via Apple’s Quick Look (if there’s a plugin for that filetype). If the Path to the external file is broken the database will lose important information about the Indexed file.

Import-captured files and folders are copied into the database and are stored in their native filetypes inside the database. The contents of the database can at any time be exported in their native filetypes and in the folder/file structure corresponding to the organization of groups and documents in the database. If one opens the database package file (Show Package Contents for DT Pro/Office databases) and looks inside the folder Files.noindex, all the individual files stored in the database can be found there.

So when one Imports a file into a DEVONthink database there are now two copies of that file.

I Import the majority of files that I capture from the Finder. After that, I have little or no interest in those files outside the database and either delete them or move them to external archive storage. I find the environment and facilities for using those files much richer in my database than their former environment in the Finder, Spotlight and their parent applications. So I work with them in my database.

In your place I would probably Index-capture the PDFs from the Papers database, as that allows both databases access to them.

I use Time Machine to keep backups of my computers (Time Capsule is similar). That automates the process. Time Machine is ‘free’ and large external drives are quite inexpensive.

But my primary backups are made using the Backup Archive script. This script was removed from public beta 6, but Christian recently posted it at viewtopic.php?f=3&t=8948&p=41541&hilit=+backup+archive#p41541

Clip the Backup Archive script to Script Editor (Applications > AppleScript > Script Editor) and Save it to /Home/Library/Application Support/DEVONthink Pro 2/Scripts/Export/.

There are many times when I don’t want to wait for a scheduled backup. If I’ve expended a lot of time and effort in making changes to a database, or have done a lot of writing inside it, when I take a break I’ll invoke Backup Archive. When I come back after break the database has been verified and optimized, and I’ve got current internal and external backups. That external archive is the smallest possible compressed and dated complete archive of the database. I store it on a portable hard drive or a thumbnail storage device. Note that the Apple store sells a 1 Terabyte external USB drive for about a hundred dollars. A 32 GB SDHD that plugs into a USB card reader sells for about $30.

The reason I consider those Backup Archive files my primary backup is that I store them offside, at my bank. If my hard drive fails, a burglar steals my computer equipment including my Time Machine backups or my house burns down, I’ve got access to a recent collection of my most important databases. Those databases are at least as valuable to me as my mortgage-free property and my Acura RDX, combined.

I don’t use online storage of backups. I have satellite broadband with slow download speeds and REALLY slow upload speeds, and I often lose access to the Internet in bad weather. If I had fast and stable Internet access I might reconsider my offsite storage procedures. But I’ll confess to a lingering distrust of ‘cloud’ storage; I want control as much as possible.

Among all the things I’ve done in the past was taking over the job as quality assurance manager at a governmental agency to force everyone to develop and follow standard operating procedures. The agency had been losing too many enforcement cases because questions were raised about technical procedures. I was not loved for shaking up the agency’s procedures, but it worked.

All I will say, from my “belt and suspenders” QA pulpit, is to say that if you ever lose data for any reason, it’s your own fault! Don’t blame the software or the computer. It’s really pretty easy to insure your data by developing and following an effective (and simple) backup strategy.

KP1 · August 25, 2009, 11:35am

I hope I’m not thread-jacking here, but what is the difference between the Backup Archive script and File>Export>Database Archive?

Thanks.

Senhal · August 25, 2009, 5:37pm

If you are thinking of originals/copies/aliases within your database, here’s my understanding (I’m not a computer geek, so someone please correct me if I’m wrong):

If you duplicate a file you end up with two independent files, each of which can be modified independently of the other, and each of which take up disk space: duplicate a 1MB file and you have two 1MB files, thus taking up 2MB of disk space.

Devonthink doesn’t have aliases as such, instead it has replicants, a rather more sophisticated method - though somewhat more baffling at first. As I understand it, it’s wrong to think of, say, ~/Documents/File.pdf as your file: in reality it’s just a reference to the actual file. (When you delete the file what you do is actually just removing the reference, telling the computer that that bit of disk space can now be overwritten if necessary.)

Thus, it’s possible to have more than one reference to one file. One way to think of this would be to imagine that you put a series of documents in a straight line in front of you. Now you need to create an index which tells you how to find each document. Essay1, for example, can be described as being document number five from the left. Equally, however, it can be described as being number ten from the right. Each reference will take you to the same document. If, however, ‘document number ten from the right’ referred to an alias what you would find would not be the actual document, but a note telling you to look for document number five from the left. If, for some reason, Essay1 is moved the real references will be updated (e.g., now being doc. 6 and 9), while the alias will now be broken. If you delete all replicants (=references) you will delete the document. If, however, you have aliases they will (usually) continue to exist, even if you delete the original. Hence, replicants are much more useful than aliases.

Hence, if you replicate a 1MB file you still only have one 1MB file, taking up 1MB of disk space. This also means that if you edit one replicant you edit them all, just as if you highlighted some text on ‘document number five from the left’ you will also have highlighted some text on ‘document number ten from the right’, because it’s the same physical document. Of course, in the wonderful world of computing you’re not limited to two references to the document; for all practical purposes you can have as many references (i.e. replicants) as you wish.

If you photocopied Essay1, however, you could highlight Essay1 while leaving Essay1copy untouched, because they are two separate physical objects. They would also take up double the amount of space on your desk. That is essentially how duplicating files work.

Did all this make any sense? It took me a while before I really got it, so I’ve tried to explain it in a way that makes sense to me; if I somehow appear patronising that’s certainly not my intent. I hardly ever used aliases in the Finder - inevitably I ended up breaking them, - while I use replicants with abandon in Devonthink. My folder hierarchy within Devonthink is for the most part rather different from my Finder one - but to tell the truth I’m really just anxiously biding my time until tags appear

Senhal · August 25, 2009, 5:56pm

And then I read your post more closely and realise I’ve gone off on an irrelevant tangent:

C’est la vie, I guess…

(But I think Papers moves pdfs around if you update/change authors/journal, so I would be wary of broken links if you just index the files in Devonthink. It’s a while, however since I tested Papers, as it’s relatively useless for the humanities… If I were you I would have a look at the size of my Papers database, as recent scholarly pdfs tend to be rather small, as opposed to the gargantuan files - especially if OCR’d - of the early JSTOR days. I still wrestle with enormous, badly-scanned, un-OCR’d files from obscure journals, which means that duplicating everything is a no-no; if I only had to deal with pdfs from the last few years from major publishers, however, I could scatter copies of my database all over the hard disk without noticing.)

sawxray · November 28, 2009, 3:51pm

Senhal-

I want to thank you for your extra effort in explaining replicant vs alias to me. I have been going back through the forums as I am trying to make a serious effort in learning DTP, and happened upon some of my earlier posts.

While you did go off on a tangent, it was in NO way irrelevant, and as of today helps me more than I would have thought when I first read it.

I just recently responded to another poster’s question about using Papers, and re-reading your excellent response helps me a lot.

I would like to index my folders in Papers, which are organized by year, and will use the synch script to keep these up to date (which will obviate the potential issue you mentioned in your last post). I believe I can then create my own Groups to start the classification within DTP. In other words, while Papers ‘groups’ my papers by year, I believe I can use replicants to recreate a new hierarchy within DTP that will ‘tell’ DTP how certain papers address surgery, certain papers address diagnosis, history, etc…

That is why re-reading your post on replicants came at a perfect time!

If anybody else has done something similar, I would love to hear it, and if there are any details I am leaving out.

Thanks!