DT change pdfs in a way they can't be indexed by some libraries

vixxovs · December 25, 2022, 3:28pm

Hi
Since I’m a Obsidian user too, I was trying to index with this plugin, GitHub - scambier/obsidian-omnisearch: A search engine that "just works" for Obsidian. Includes OCR and PDF indexing. that uses the “minisearch” library, few pdfs.
I noticed that when I use pdfs exported from the Devonthink database the library fail to index them, the same, pdf but kept outside Devonthink, is correctly indexed.

These pdfs are both not ocred, ocred in DT or ocred with other software so It doesn’t relate with the nature of the pdf but with being through DT.

How can I avoid this? I would like to keep my pdfs fully compatible even if I export them out of DT.

Thanks

chrillek · December 25, 2022, 3:46pm

I seriously doubt that. Can you open the PDFs in other apps? Does the software you’re using report any errors?

vixxovs · December 25, 2022, 3:49pm

I’m sure of this, tested with multiple pdfs with different nature, It can be easy replicable with obsidian and that popular addon.
Other apps don’t look like they suffer from this issue but this library in particular can’t index the text inside the pdf.

chrillek · December 25, 2022, 4:57pm

And why wouldn’t the Problem be with the library? Again: do you get any error message and what exactly does “can’t index” mean?

vixxovs · December 25, 2022, 5:12pm

@chrillek I correctly reported other dt bugs in the past so I’m not here in first instance.
Before here I talked with the plugin developer ( here: [BUG] can't index and find inside pdfs · Issue #163 · scambier/obsidian-omnisearch · GitHub ), in the discussion you can find a report of the “error”. As “indexing” I’m referring to the possibility to read text inside the pdf in order to search for specific words.
I didn’t think the problem was in the library because the same pdfs don’t suffer the same problem if they do not transit inside DT so I was wondering what DT changes in pdf structure.

chrillek · December 25, 2022, 5:49pm

Well, we’re talking about this issue. And after I read through the issue thread on GitHub, I still don’t see why that’s a DT problem. You said yourself that other apps can find and use the text layer in your PDF. So, it seems quite obvious to me that this library can’t find the text layer. Others can.

I’d suggest that you create a single, reproducible case where
– a PDF is indexable by this Obsidian plug-in outside of DT (where exactly is it residing, then?)
– import the same PDF into PDF and then export it again with a different name into the same folder as the original one.
Now, run cksum -l original.pdf transited.pdf in Terminal – what is the result? If the results are different for the two files, DT did something to them. Otherwise, it didn’t.
In addition, you might want to run xattr -l original.pdf transited.pdf in Terminal as well.

Now check if the plug-in can index the second file, i.e. the one that you exported from DT.

vixxovs · December 26, 2022, 10:11am

Tested with few pdfs, here’s an example output of checksum:

chrillek · December 26, 2022, 10:43am

Indeed. Weirdly enough, drag&drop does not change the file. I.e., if you drag the file from DT to its new destination, it should remain the same.

I’ll open a new thread focusing on this difference.