Indexing Items -- question

I have 20+ hard drives to index. It’s working fine so far, but I have two questions.

  1. after a drive is done being indexed into the DT3 (ie 1000 out of 1000 files indexed and then the progress dialog closes), DT now starts INDEXING ITEMS and this takes FOREVER. Sometimes overnight for a 2TB drive. My question - where is that indexing taking place? Can I unplug the drive and still have the internal indexing continue? It would be nice to be able to start on another drive while this one continues indexing.
  2. Time Machine thinks that I have 15TB of data on my 500GB macbook drive due to indexed files and will no longer work. So I have created a new database for indexed drives and I’ll store it on an external drive I suppose. Is there a workaround for this? Even excluding the DT database file from the TM backup didn’t help. It just thinks I need more space on my TM drive.

THanks
Paul

Why are you indexing 20+ hard drives?!?

Indexing database takes space. You could place those on each of the drives, I guess, i.e when you create the database, ask it to be saved on each individual drive. But 20 drives, wow. Are the drives 1Tb or so, you could get good deals on 8Tb drives today?

I am a film producer, and over the years we have produced more than 20 films. The 20 was just a random number, it’s probably closer to 40. Most are in raid arrays with 8 3TB drives in each. I need a way to be able to look for photos, footage, contracts etc without having to hook up a macpro so I can run a raid array and bla bla bla. This let’s me say “oh, I have the music track for that on Raid 3, drive 5” and then go get it. Maybe indexing is not the best way.

That’s an unusual situation but indexing would be th best option. However, indexing takes up space since it gathers the text and metadata for each file indexed. Indexing doesn’t make a simple link to a file.

Here is a database with 100,000 files, each containing only the line “This is file 000001 (000002, 000003, …)”.

Look at the size of the content…

And look at the size of the database file in the Finder…

Now consider your situation.

Is all that metadata stored inside IPCT or EXIF metadata (or MP3/AAC tags for audio), maybe there’s a way to have a simpler query tool that scans contents and pre-builds a set of data (such as exiftool) as a plain text file which is then easier to add to an indexed database?

PS: Yes I know the movie metadata standards are a big mess.

Sorry, just to make sure I am seeing this right. So you index those documents, and they take up 2MB of space IN THE DATABASE but somehow they take 100 times that space in “finder”?

That is correct.
The size of the content of the database is 2MB.
The database file contains the metadata, etc. about the files and can also contain internal metadata backups. And actually it can be much higher than the number I cited (though I’m discussing some of it with Development).

Thumbnailing of these documents was enabled and this caused the overhead.

Indexing is usually done in two steps:

  1. DEVONthink scans the files & folders and adds items to its database referencing the originals

  2. A background task creates optionally thumbnails, retrieves metadata and indexes the text of the files. This task requires more time (especially in case of PDF, images, audio, video and third-party file formats like Excel or iWork) and needs access to the originals.

As far as TimeMachine goes, just go into T.M. Preferences and exclude the folder.

I’m building a reference database, using primarily index of Mac Folders. When I view the DT3 file packages on my Mac, the data file is about 600MB on an indexed folder 7GB, 16,000 files. Since these are just links, I’m surprised the data file is that large. Just to verify, I exported that database and the zip file is about 290MB. Are these file sizes what would be expected for indexed files? Maybe I’ve done something wrong - the files appear to carry the “index” icon. The database will be increasing to 200+GB with 60,000+ files. What size should I anticipate for the final database?



Screen Shot 2022-03-26 at 5.15.08 PM

  1. You shouldn’t be messing about in the internals of a database.
  2. Yes, that seems reasonable. We just blogged about this topic…

It’s impossible to accurately say but you could use the current values to estimate the final size if the remainder of the content is similar.

Thanks and I’ll check out the Database Sizes post. I understand your caution about messing with the internals of the database - one little click error is all it takes to toast the data. But I was trying to figure out why the DT3 file was nearly 2GB on 7GB of indexed data. I wasn’t thinking about the 2 backup files taking up so much space. If I’d thought about it, exporting to Database Archive and then extracting it would have been the safer way to figure out the file size.

Understood and I’m glad you are aware of the potential pitfalls when looking inside the database package.

So yes the thing you have to remember is text has a “weight“. In text heavy databases, the file size can seem larger since the searchable text is stored inside the database for faster retrieval and AI.