Workflow question-home bills and research

I purchased a ScanSnap and am using the trial version of DevonthinkPro. Though not technologically challenged, I am having difficulty figuring out how the program can achieve the following. I would like to be able to scan in home bills, correspondence, letters (non-work related documents). On the scanning end, I could batch scan all the documents but then end up with one huge pdf file. Upon importing into DT, I now have a document composed of let’s say, ten different bills. After performing OCR, DT could then assign this one pdf file to ten different groups (e.g., Utilities, Telephone, Insurance, etc.) Thus I would have the original plus nine replicates to cover all ten groups. Later on, if I wanted to search for a particular bill on a particular month, or if I wanted to see and/or print several like bills, then I could search on, let’s say, ConEd (my utility company). But I would then need to go to, let’s say, page 7 of one pdf, then page 3 of another pdf, and so on. The advantage of scanning all documents in one batch is obvious. However having one pdf file containing so many disparate documents poses its own problems.

Of course I could scan each document to its own pdf, thereby creating ten separate pdfs. It is more work up front, but in this case I wouldn’t even need DT as I could rename each file by the name of the bill, with the date of the bill.

How do people handle this issue (without scripts as I do not know how to use them). To beg your indulgence, one more pressing question relating to research:

I have one program for citations (Endnote). I am considering DT to be the repository of all my many electronic documents as well as notes. Does anyone know how to link references (in Endnote) to entries into DT? I would rather not re-enter reference citations into DT. Thank you!

I haven’t used EndNote for a few years, but I have an inkling that there’s some kind of permanent number for each reference that, dragged into another program, serves as a link to that reference. Apologies if I’m totally mistaken about that.

A less “live” or dynamic way to bring your citation manager and DT together is to export EndNote references in batches to BibTeX and then import to a DT sheet. You could do that in 10’s, 50’s, or 100’s depending on how quickly your EndNote database grows. It means there’s usually a lag before your DT database becomes fully up-to-date, but in practice it’s no big deal.

A variation on this is to make full use of the EndNote Notes (or equivalent) field(s) before exporting. This reflects the idea that there’s a basic difference between notes on a specific paper (which even though you wrote them “belong” to that paper), and your own notes that refer to a number of papers. In this case you may find that a simple RTF export from EndNote and drag or import into DT (making sure you’ve chosen a format that includes those notes you’ve made) works just as well. By adding content that is highly meaningful to you to the raw materials that DT takes in, you make it easier for DT to process it in a meaningful way.

Taking this even further, you could use Smart Groups (or whatever EndNote calls that kind of function) to do a quick pre-sort of some of the references. This will give you some meaningful document groupings within DT. However, I would do this in addition to the unsorted groups to make sure nothing was left out.

I suspect that not all DT veterans would approve of this system, so you may want to wait for some more responses before going ahead.

And I recommend doing a forum search on EndNote, Bookends, and Sente to access previous discussions on similar topics.

The problem with this is that the citations even after being imported into DT won’t be linked to their associated PDF’s that have been imported into DT. I don’t believe there’s an automatic way to make this happen. Secondly, once an endnote library is imported via bibtex into DT, if there is a change to the endnote library, there’s no way to just “sync” it with the sheet in DT.(please correct me if I’m wrong and I do hope I’m wrong)

Other than the bibtex to DT sheet I haven’t found many endnote DT synergy tips. I’m always hopeful they’ll be more though since these two apps are the ones I use for citations and notes.

Just a quick FYI, that Bookends is 50% off today at macupdate.com/ for anyone interested.

As a newbie I have the same question as “brukal” (which no one has yet answered) about the division of batch scans into separate locations.
I have literally mounds of paper that I need to file digitally. Should I go ahead and file all like documents (from the mounds) in one batch? How do I then keep up with succeeding documents? Scan singly? That defeats the purpose of having a batch scanner and puts me back where I am presently w/ my flat bed scanner.
I’m hoping that DevonThinkPro will be the answer to getting rid of my constantly growing backlog but am still trying to work out how to set up the database.

Hi, RecordFanatic. There are several options concerning your questions about batch scans.

  1. In ScanSnap Manager > Settings > File option note the Option button. Click on that and you can choose to have all of the copy in the sheet feeder combined as a multi-page PDF, or you can set it to create a new PDF document for each 1 (or 2, & so on) pages scanned. If you are scanning a batch of one-page bills, for example, you can set the option to produce a PDF for each bill.

Note that there are other settings that may come in handy. Some of your bills may have text on both sides of the page. If so, set the scanner to do two-sided copy and uncheck the option to remove blank pages (ScanSnap Manager > Settings > Scanning - Option). Then set the option in Settings > File option to produce 2-page PDFs. I haven’t tried this, but guess that it would handle a batch of bills, some of which have copy on the back of the page, so of which do not.

  1. If you have done batch processing of mixed documents to produce a multi-page PDF, you can use a third-party application. Adobe Acrobat offers the most extensive capabilities for splitting/merging PDF pages. PDFPen can be helpful and is much cheaper than Acrobat. Preview allows limited features such as moving a page to another PDF file, or deleting a page. And it’s possible to create an Automator workflow to merge pages.

  2. Personally, I feed complete documents, whether 1-page or multi-page, into the ScanSnap sheet feeder one at a time, as scanning is pretty fast. In order to keep the workflow moving rapidly, I’ve unchecked the option to add document attributes in Preferences > OCR. And I’ve told ScanSnap Manager to create multi-page PDFs and remove blank pages. So I can go through a stack of bills quickly, although my computer will continue to process the queue of PDFs to be OCRd for some time after I’ve finished feeding in paper copy. When OCR is finished, I’ll return to the database and name the newly added PDFs, usually by selecting an appropriate string (company name or whatever is appropriate) and invoking the contextual menu option Set Title As. I don’t spend much time on organization. I’ll probably just drag a collection of bills, for example, into a group named Bills 2008. If I’m tracking something more specific, such as a renovation project’s cost, I’ll create a subgroup for that purpose. And I rarely need to add a keyword, as Search can usually bring up all my renovation or electric bills very quickly. I usually add a date cue to the names of contracts, bills, etc. in this format: YYMMDD. So the tag for today’s date will be 080408.

Tip: I can do a Wildcards search for a string such as that date cue by using Tools > Search. The query will enclose the string between asterisks, thus: 080408, and the chosen Operator is Wildcards. Suppose I wish to list all my bills for the month of April, 2008. I simply do a Wildcards search for the query string 0804 and limit the search to my Bills 2008 group rather than the entire database.

As I’ve made the titles of documents descriptive and each bill has a date in its Name, I can very quickly find what I’m looking for. Without having done a lot of subgroup organization a priori, I can quickly slice & dice my information by moving or replicating search results to a new group if I wish. So I can quickly create a group containing all my electric bills for 2008, if that’s useful. Or I can quickly file all the bills associated with a renovation contract, if that’s useful.

Unless I need to do that up front, I don’t bother with much detailed preliminary organization. That’s what I used to have to do when managing paper in a file cabinet – and if I made a filing mistake or my organizational design didn’t meet my future needs, I couldn’t find what I needed. DEVONthink lets me find what I need, and organize it differently whenever that’s useful.

Something to consider, especially for the household paperwork.

How often do you actually need to reference this stuff? Personally, once an item is noted in Quicken, I almost never need to see it again. I’d say that, at most, I reference old financial items a handful of times per year. Even my taxes can mostly be handled through Quicken reports, and those items that are tax-relevant can usually be identified and filed separately in advance.

For that reason, any scan-and-file approach to household bills, no matter how efficient, is unlikely to be worth the time.

Different situations and different kinds of documents may require different kinds of handling, of course. But you might think about what exactly your needs are before you embark on a massive (and time consuming) scanning project.

Katherine

Katherine, you are right to a point. I’ve got file boxes of bank statements, cancelled checks and bills going back for years. I hardly ever need to look at that old stuff. And I’m certainly not going to scan it.

But last year I sold a house. The buyer asked for the previous year’s utility bills (and so did I, when I bought my “new” log cabin). I rummaged around in a box and was able to put those bills together.

I don’t use Quicken any more. I hardly ever write checks these days. Routine bills and my banking accounts are handled electronically. I have electronic bank statements and electronic – and automatically paid – billing statements for my utilities, phone, wireless phone and ISP. So I keep those records in a database. Great. I can find stuff if I do ever need if. And all that stuff is invisibly small. No stacks of paper, no boxes of paper.

Even so, I keep getting paper coming in. Renovation project contracts. Warranties on things. Insurance policies. Almost every day something shows up in the mailbox that I probably should keep just in case. So if I don’t watch out, I’ve got stacks of paper building up.

So a few months ago I adopted a new policy. If paper comes into my house (except for books) that might have more than transitory importance, I scan/OCR it and toss the paper. If a piece of paper has personal information on it, I shred it – including all those darned junk mail offers of credit cards (which I don’t scan).

My subscriptions to scientific journals are online only. I don’t want printed copies. I used to have shelves full of printed weekly journals like Science. I gave them to a library that wanted them. I don’t even subscribe to a newspaper. Why should I? I can read them online, at least until they all go bankrupt. As for grocery ads, those keep coming into my mailbox anyway.

That doesn’t mean that I don’t have a lot of paper. I’ve got thousands of books. I love them and still have enough shelf space for lots more, so I probably won’t go the ebook route. And some paper is important in original form, like passports and birth certificates, or collectible prints and items like a 13th century manuscript or a page from Samuel Johnson’s original Dictionary.

I sold a house last year, too. When I did my taxes, I needed the closing sheet from when I originally purchased it, back in 1995. Thirteen years is a long time in computer terms. That would have been four or five computers ago, and possibly two or three changes of “standard” format. In electronic form, who knows whether I would have been able to retrieve it, but finding the paper was a matter of a little rummaging.

I too keep as much of my research material electronically as possible. I skim through paper journals that people insist on sending, but assume that the journal archives are available online if I ever need to reference something again.

But for long term personal records… I know I’ll still be able to read a sheet of paper after 13 years. The history of electronic records is much less reassuring.

Katherine

Katherine, you’ve emphasized something that’s really a very serious problem. As civilization has evolved it’s become apparent that most of the ways that have been devised to store information have become less permanent as the density of information storage has increased and ways have been devised to make that information more widely available.

Records inscribed in stone, or in clay that is then baked, have survived for thousands of years. But even when the physical form of records is persistent over time, languages change and the meanings of inscribed symbols can be lost.

The development of writing on papyrus, animal skins, tree bark, cloth and paper made it possible to increase the density of information storage and decrease the cost of transmission of information. A few such records have survived for thousands of years. But most of the original records have been lost by accident and decay, and much of what has survived has been through generations of copying and recopying the originals, with consequent accumulation of clerical errors made by scribes.

I’ve got a parchment page from a 13th century Bible, beautifully penned and illustrated by a monk who was using the best technology of his time to transmit that data to others. That page is in pristine condition. I’ve also got a page from Samuel Johnson’s original dictionary of the English language. I bought those items at a museum sale back in 1954. Why were they for sale? Because only a few pages of the original book of which each was a part had survived; the original books were not restorable. The proceeds of the sale of those pages went to finance efforts to preserve and restore books, so that some could survive longer.

Paper is much cheaper than parchment made from animal skins. In the past, paper was much longer lasting (and more expensive) than most of the paper we use today, as it was made from linen fiber. I’ve got some books printed in the 18th century and the paper has survived well. But as demand for information increased cheaper paper technology was developed, and paper was produced from wood pulp. I’ve got books much more recent than those 18th century ones, with the paper literally crumbling. Librarians are warning that we are in danger of losing much of the written material from the 19th and 20th centuries for that reason. Large sums of money are being spent on preservation efforts, including digitizing such data before it becomes unreadable. (Early on, microfilming was viewed as a preservation method; now those old microfilms are crumbling, too.)

And so to digital data. I first became involved with digital information as project director of a university center to transfer environmental science and technology information resulting from federally funded research, back in the late 1960s. We received magnetic tapes that could be searched. Input for searches was done by punched cards. Output was printed on paper as a list of numbers corresponding to potentially relevant numbered literature abstracts, which were in paper form. I won’t go into how primitive and labor-intensive that system was, but for its time it was regarded as a technical marvel.

What about those old magnetic tapes, which were the primary vehicle for storing and distributing computerized information then? Many of them still exist, but are unreadable now, for two reasons. 1) Not only has time made the physical matrix of the tape unreliable, but the magnetic particles that stored the data have weakened so that read errors increase over time; 2) even if the tapes were still pristine, it would be difficult to find an existing read system that could interpret the data. Computer technology has evolved and the old equipment has been obsolete for decades.

I bought an Apple II+ with 64KB RAM and two floppy drives back in 1980. It came with several programs on cassette tapes, as well. I’ve still got a computer capable of reading those 140 KB floppy disks, although I haven’t fired it up for years. Those old floppies were pretty unreliable, although to my astonishment most of them were still readable about 5 or 6 years ago. But there’s a wall between the information stored on those old disks and my present-day Macs. My Macs can’t read the information on those old floppies. Yes, it’s technically possible (there are hobbyists who do this), but so involved that I’ve never gotten around to it. As a practical matter, my main access to that old information is through surviving paper printouts made at the time.

Then I moved to Macs with the “Classic” OS. I’ve got a lot of historical information created then that’s not accessible by my current Macs. My bridge to that information is through a couple of older Macs kept for that purpose. It’s a bit easier to transfer information from them to my OS X Macs, but still a chore. So there’s a wall between my past data and the data systems I use nowadays. And much of that Mac OS data is on floppy disks and Iomega disks, storage media that become unreliable over time.

Today, I rely on hard drives for primary storage of data, and keep some DVDs of important backups at an offsite location. I’ve got almost 2 terabytes of hard drive storage available. I’m managing more than 150,000 documents among a number of DT Pro databases.

Katherine’s point is really important. I’m on my fourth generation of computer use, from the old university CDC 6600 mainframe, to the Apple II+, to Mac OS and now Mac OS X. The further back in time, the less accessible that digitized information becomes to me in the present. That time frame is only over a few decades. How readable to someone else will my information collections be a few decades in the future?

One of the things I like about DEVONthink is that I can export everything to the Finder. But even then, most of the computers in the world today can’t read that information from my Macs. By 2050 I suspect that mechanical hard drives will be hobbyist curiosities and my DVD backup discs may have become unreliable, even if one possessed the equipment and software to read them and my hard drives.

The beauty of digitized information is that an enormous amount of information can be stored in a small “space” and can be copied over and over with far fewer errors of transcription than were made by scribes in the past, and at almost no cost. But that doesn’t mean that all digital information can be read by all digital devices – far from it. As a practical matter, the rapid evolution of computer technology and software has also meant the rapid obsolescence of previous digital storage and management technologies, as well as a Tower of Babel resulting from incompatible hardware and software technologies.

I’ve visited Egyptian temples thousands of years old, in which the accounts of gods and pharaohs are still visible as hieroglyphs inscribed in stone. How much information of today will still survive and be readable thousands of years into the future? As optimistic as I am, the answer could be that we will pass on less information than did the Egyptians.

1 Like

Bill,

When you scan your bills at what resolution do you set the ScanSnap?

Thanks for the tip,

Monica

A resolution of 300 dpi is recommended for good OCR accuracy.

I usually have ScanSnap Manager > Scanning set for B&W (black & white) at the Better setting.

If your copy has color content and you’ve set automatic color detection, please note that color scans are at half the resolution of black & white scans. Raising the Scanning setting to Best will allow 300 dpi for color (and 600 dpi for black & white pages). The Excellent setting is overkill; takes a long time with generally no significant improvement of OCR accuracy.

I believe the exact opposite. I think our societal load of information is snowballing and will become so immense and unmanageable that what information we have will be practically useless.

For instance, I don’t really use the WWW all that much. I use Wikipedia about 85% of the time I’m online. Maybe 10% of the time I’m using three to five other sites – like this one, and MacRumors, and the little site I’m building (more on that later) – and the other 5% I’m using Google to find the answer to a question, generally a technical question like PHP syntax. So 5% of the time I’m online, I’d say I’m actually exploring the WWW – and really, I think I spend less than a half of one percent of my time actually finding a new site.

It’s an information overload. Most of the time, Wikipedia gives me about five or ten times the information I actually need. If I Googled the same subject, I might come up with millions of hits, most of which would be (at best) only tangentially connected to what I’m looking for (or spam/scam crap).

As PC storage gets larger, the temptation to keep everything will become stronger, and I think it’ll become a matter of course. Why delete those cached webpages? Simply add a new version. And everyone’s computer will become like a selective version of the Internet Archive, storing the Internet and their own documents and every movie or song they’ve ever downloaded, and so forth.

I suppose it’s ultimately a question of whether AI or self-discipline will evolve faster. AI could tag your songs properly, determine whether a webpage is useful or not, and determine whether you really need six copies of the same files taken from six different backup archives that you unarchived trying to find one file that doesn’t seem to exist anywhere. Or we could be disciplined (or anal-retentive, depending on your perspective) and maintain the things that are truly important to us – adding tags, deleting the cruft, summarizing the overly verbose.

I figure AI will come along sooner or later. But with all the crap we have to feed it, it’ll probably rebel. Perhaps that was the real story behind The Terminator. SkyNet got fed up with processing our caches.

1 Like

I too have a scan snap and use it with DT and the combination is amazing. I use 4 databases - personal, one for each business and 1 for resumes I receive. It is very fast to scan like others mentioned. I personally scanned bills by month until I got up to date and now scan as I receive them. I use 2 folders - paid and unpaid for bills and scan to unpaid, once I pay it I move it. I haven’t bothered naming them as the OCR works well. If it is a receipt of sorts than I name it but typical utility bills are read fine with the OCR.
Also, I scan my returned cheques and the OCR reads the cheque numbers fine. Including the typed name if I perform a name search.
A great set-up, I took my filing cabinet out of my office after about 6 months (took a while to get up to date). I save some web-sites but found that to be sketchy at best, when I was price searching for various cars, the url would at times be broken. Now I just Command+Shift+F4 and copy the page to my desktop and drag it in to DP and assign a name to it.