Unable to import large documents

jkr · December 5, 2007, 1:49am

I have updated to the latest version of DT (10.5.1, MacBook Pro) and now I find myself unable to import large-sized txt documents (it starts when they over 10-20 MB, I think. At first I thought, it’s my database, but when I created a brand-new one for test purposes, I had the exact same problem. Has anybody else experienced a similar problem?

Bill_DeVille · December 5, 2007, 3:06am

I’ve got to ask. What do you have in text documents that are bigger than 10 to 20 megabytes?

Now for what may seem a silly question; Do you have enough free disk space for the database to expand by that amount?

jkr · December 5, 2007, 3:11am

They are dataset files (survey data). And yes, I have plenty of gigs available. It only happens with text and Word files, however, so far.

cgrunenberg · December 5, 2007, 11:05am

Anything logged to the system console (see /Applications/Utilities/Console.app)?

jkr · December 5, 2007, 1:16pm

Nothing logged in the Console. When I try to import larger text files (I haven’t tried larger PDFs yet), DTPro does indeed slow down and import the files, but then I end up with the files in the database with 0 bytes and nothing in them. Oddly enough, the log in DTP does also not log anything.

So there seems to be action (in terms of the files being imported), but then I end up with empty entries.

cgrunenberg · December 5, 2007, 1:30pm

That’s indeed strange. Could you send me a zipped example file? Then I could check this over here, thanks!

jkr · December 5, 2007, 2:48pm

They are very large, even when zipped. You can give it a try downloading one of the files in questions. Select “Student questionnaire data file” from “Data sets in TXT formats (compressed) and documentation.” Is there a limit on the file sizes one can import into DTP?

annard · December 5, 2007, 2:58pm

Do these files load in TextEdit successfully, that is a good indicator.

jkr · December 5, 2007, 3:02pm

Yes - they load fine in all txt apps (even Word does handle them well).

cgrunenberg · December 5, 2007, 3:33pm

Yes, the limit for plain text files is around 200 MB but the above example is 496.5 MB large and TextEdit needs 1 GB to display it. I can’t even imagine why one would like to store this in DEVONthink.

jkr · December 5, 2007, 3:54pm

The manual does imagine it - there it claims DTP can serve as a repository for projects and related files to keep all the files together. Having a “clean” copy of the original dataset available in a repository is a vital aspect of dataset-management and -creation. That way, I can keep all of my datasets together in one – portable – file. Or so I thought.

I guess, I’ll have to find another solution then

cgrunenberg · December 5, 2007, 4:53pm

You could of course add the zip to DEVONthink.

jkr · December 5, 2007, 5:06pm

Don’t know why I didn’t think of that – thanks! Are there restrictions in the size for .zip files?

cgrunenberg · December 5, 2007, 5:14pm

The only limitation is the available space.

Bill_DeVille · December 5, 2007, 5:45pm

Yes, you can put all kinds of things into a DT Pro database. You could fill a Rolls Royce with wet cement to use for your driveway – but that doesn’t make sense.

Putting those data files inside a DT Pro database would impose an overhead on the database for building the Concordance, for instance. But that’s to no purpose, as the datasets are intended for analysis by SAS and SPPS, and that can’t be done in the DT Pro database, anyway.

I suspect that the text file limitation of 200 MB that Christian noted could be raised. But I can imagine no useful purpose for using DT Pro’s document management and information analysis tools on 200 MB documents, anyway. The text content of any individual letter, article, report or book rarely exceeds 1 MB and most “chunks” of textual information are much smaller than 1 MB.

I could download plain text files from Project Gutenberg of all of the writings of Charles Darwin together with secondary books about Darwin and his work. Then I could merge them all into a single text file, which I doubt would approach 200 MB in size. If I then wanted to analyze the development of Darwin’s ideas in his books and letters, and perhaps to consider how his contemporaries viewed those developments, I would have made DEVONthink Pro’s tools almost useless by putting everything together in one enormous chunk. A search for “survival” would yield one document as the result. A search for “Galapagos” would yield one document as the result. If I’m viewing the document and choose See Also I get one result, that unwieldy document itself. I’ve turned a Rolls Royce into a cement truck. I could work much more productively with a DT Pro database that contains the individual book and article files downloaded from Project Gutenberg.

Your datasets are not cement trucks. They are themselves databases that contain a great many smaller components that are subject to statistical analysis by the SAS and SPSS tools. Your statistical software “sees” the file not as a single large text file, but as a collection of thousands of individual files (Just as my main DT Pro Office database file is actually a collection of tens of thousands of individual documents.)

Want to store and transport your datasets? Put them in a folder where they are accessible by your SAS and SPSS tools. (The datasets would not be available to the statistical analysis tools were they stored in a DT Pro 1.x database, unless exported back to the Finder.) Now keep your notes about the data and your analyses into a DT Pro database, where it can be of assistance to you.

kewms · December 5, 2007, 6:36pm

Leaving aside for a moment the question of whether importing such files into DTP is wise or useful, it does seem to me that the reported silent failure is rude. If DTP can’t handle a file for any reason – including size – it should log that fact so that the user can take appropriate action.

As for the files themselves, I would think that the program responsible for creating the dataset in the first place would be the best tool for data management, including maintenance of a clean copy. Surely you can’t be the only user of the package who has that need. If the issue is that you need to guarantee that the archive copy hasn’t changed, then you could use an inherently non-editable medium such as CD or DVD. If you use this kind of material often, you (or your organization) might even look at something like DSpace (dspace.org/), which is specifically designed for archival storage of large research datasets.

Katherine

jkr · December 5, 2007, 6:55pm

I just tried and it won’t work either

I have a dedicated dataset DTP file that for example contains all the dataset handbooks (since they describe all the variables and questionnaires, etc.), properly set up (in different folders, by issue/source/etc. with replicates, etc.), so my (“see also”) search for which other datasets might also fit the bill usually yield good responses. In other words, Bill_DeVille, my RR is running just fine. The point of having the datasets in a .zip file available is simply to have it all in one place once I know which datasets I should also look at and to have a back-up copy. I exclude those archives from the searches (“exclude from classification” - which I assume is a function for such cases). The big advantage is to have all your stuff in one place (or rather package), so I can use it in various places and can easily take it with me (office, home, colleagues’ computers, etc.). Some people might like to have files in different places on their computers. Others might like to have their critical data in one package. Linking to files has not worked so well for me, as file organizing structures tend to change in my work-flow, so having the zip files right there is making things a lot easier for me. So I am afraid, it does make sense for me, even if that’s seemingly hard to fathom.

Bill_DeVille · December 5, 2007, 10:15pm

Christian’s suggestion of storing the zipped file rather than the uncompressed text file in your database is on point.

I just downloaded that file (INT_schi_2003.zip), which is zipped. Imported it into a database with no problem. A .zip file is “unknown” so doesn’t get indexed, doesn’t add to the memory footprint of the database when it’s loaded and doesn’t get copied into the internal Backup folders.

The creators of the file thoughtfully put the download URL into Spotlight Comments, so that this information shows up in the document’s Info panel Comment field.

So yes, a zip-compressed version of a large text dataset can be stored into a DT Pro database along with the associated documentation, notes, etc. without causing indexing overhead, without significantly increasing the memory requirements for loading the database and without “bulking up” the database package size by being included in each internal backup folder through a series of database backup operations. That zip-compressed dataset file is actually stored in the Finder, in the Files folder inside the database package file. It is as safe as any other file on your computer, assuming a sound OS and hard drive. As always, an external backup on a safe medium is a good idea.