Multi-page TIFF / TIF handling?

sangye · August 12, 2023, 1:39am

We’re about to receive ~50,000 pages of multi-page TIFF and/or TIF files, and I’m a bit worried about processing them locally. Does DT3 play nice with multi-page TIFF? Will it OCR and PDF them?

BLUEFROG · August 12, 2023, 2:45am

Do you have a sample file to test?

PS: I hope you’re not thinking of queueing up 50,000 TIFF files to OCR.

sangye · August 12, 2023, 3:07am

Inquiring now-- thanks.

And yes, that is actually my plan—splitting the task across three Macs. I recently OCR’d about that many PDF pages in DT3… Started it in the morning and it was finished by the evening. But maybe processing TIFFs is more intensive? I don’t know what to expect, this is the first time I’ve had to deal with them.

BLUEFROG · August 12, 2023, 3:42am

Where are you getting these files and for purpose?

sangye · August 12, 2023, 3:48am

It’s an e-discovery dump coming from a state regulatory agency in the U.S. during a lawsuit. Evidently some of the more popular e-discovery platforms use TIFF

BLUEFROG · August 12, 2023, 6:06am

Gotcha. I’m familiar with them from the printing industry.

tjur · August 12, 2023, 12:41pm

Hi,

had the same issue. Here is the solution:

Install Homebrew (-> https://brew.sh → execute in Terminal:

/bin/bash -c “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)”

(don’t forget the quotations at the end!)

install ImageMagick in Terminal:

brew install imagemagick

put all tiff-files in a folder (e.g. ~/Desktop/tiff-convert)
execute the following command in Terminal:

find ~/Desktop/tiff-convert -iname “*.tiff” -exec convert {} {}.pdf ;

(at the end is a “;”!)

This converts all tiff-files in the folder to PDF files. But please be aware, that in some cases (2-3%) it doesn’t work.

The PDF files should then be processed with OCR in DEVONthink.

chrillek · August 12, 2023, 3:00pm

Does that mean that DT can’t OCR multi-page TIFFs?

BLUEFROG · August 12, 2023, 3:03pm

From my initial test, yes it can. However, we are trying to locate some good real-world example files to test.

chrillek · August 12, 2023, 3:30pm

@tjur: as @BLUEFROG said (and I could confirm), DT handles multi-page TIFFs just fine. I.e. it can OCR and convert them to PDF. So, which problem did ImageMagick solve in this case?

I found an ugly multi-page TIFF here:

Interestingly, it seems to use different compressions for the pages.

As to real-world samples: The USPTO used TIFFs in the past, and you can find a lot of those here:

http://storage.googleapis.com/patents/grant_multi_page_imgs_before2000

You’ll find a heap of ZIPs, each of them containing tons of files filed by USPTO No. The quality of the TIFFs might be terrible, though.

Some anglophone courts are perhaps also using TIFFs, but I couldn’t come up with a sample yet.

(Why on earth does anyone still use text-less TIFF instead of PDF for text?)

BLUEFROG · August 12, 2023, 8:47pm

Yeah - that’s the sample file I found too. Thanks for the link for extra TIFFs too.

tjur · August 13, 2023, 5:54am

Yes that is the problem. I’m dealing with many TIFF files from the german business and commercial register. In my estimation, DT’s OCR function didn’t work with at least half of the files. DT only recognized and saved the first page of the TIFF. That’s why i use the “convert” command.

chrillek · August 13, 2023, 6:59am

Bundesanzeiger? I was looking there, but only cursory. Found only PDFs. Maybe you could pass one of the non-working TIFFs on to the DT developers?

tjur · August 13, 2023, 7:11am

here:
https://www.handelsregister.de/

I’ll send some files later where the error occurred.

tjur · August 13, 2023, 7:58am

I just checked some TIFF files again (where I think the error occurred last time) and no errors occur now. Could it be, that one of the last updates of macOS or DEVONthink fix that issue?

tjur · August 13, 2023, 8:02am

However, thumbnails of the TIFF files are still not displayed…

chrillek · August 13, 2023, 8:12am

That’s not related to DT, I think. Thumbnails are handled by macOS, and if there’s no Quicklook plugin for TIFF, there’s no thumbnail.

tjur · August 13, 2023, 8:59am

Ok, I assumed that the tiff files from the commercial register are corrupted and therefore also displayed in Finder without thumbnails. However, “Preview” can display them. Weird tactics by Apple not to deliver a native tiff QL plugin…

I’ll get back to the error if it comes back to me.

cgrunenberg · August 14, 2023, 7:41am

The Thumbnails inspector supports only PDF documents currently.

rmschne · August 14, 2023, 7:51am

Curious: is Thumbnails Inspector software from DEVONthink or macOS, or combination?