OCR speed and duplicates

I am not sure, if I’m doing something wrong, but:

I added 2 small docs to the DT3 Inbox:

  • first is a screenshot (.jpg)
  • second a small document scanned by Scanner.pro (.pdf)
    to check OCR features in DT3.

For both docs it took ages (ok - a couple of minutes) until they were added to the inbox… in the meantime I saw the activity-panels, but they didn’t progress - see here:

The first time I tried, I cancelled the process, because I thought it doesn’t work. The second time I let it run for a couple of minutes, then it worked. But:

  • for the .pdf, DT3 added a duplicate ‘.pdf.pdf’ (as ‘PDF + Text’).

How can I avoid this error and also the duplicate for OCR’d PDF documents, rather than adding the info to the doc itself?

Btw: PDFPen - which I normally use for OCR, took for both docs approx 10 seconds and added the text to the .pdf (so no duplicate pdf).

Any hint?

Thanks
andy

The OCR would normally remove the previously file extension and add a .pdf when generating the PDF file. I will check why this happened.

By default a new PDF record will be created after the OCR. You can change this by opening the menu DEVONthink 3->Preferences and under the OCR section turn on the option “Original Document: Move to Trash”.

thx @aedwards

Meanwhile I found that setting and I also re-newed my license to keep development going (although I am not at all happy with the new 2-computers restriction, as I already mentioned elsewhere).

The pdf problem is strange - I will try to create a reproducible case for you.

2 more things:

  • Do you have any idea, why the OCR is that slow?
  • Is there a setting to OCR pdfs while importing (without triggering manually)?
    Reason: my tax-guys always send me tiffs (or jpgs if I’m lucky) embedded in pdf (it seems their scanner does it like this). I want these documents OCRed as well, but as is I have to trigger that manually…

Any ideas?

Whilst the ABBYY OCR is probably not the fastest, it is one of the most accurate OCR engines available. There will always be a balance between speed and accuracy and the OCR is configured for accuracy.

How are you currently importing the documents? Are the files in a folder or in an email?

Currently I am testing, so I add the files via drag and drop from the Finder…

If I drag the same file to PDFPen the OCR is instant (couple of seconds)

While trying to reproduce the duplicate ‘pdf’ suffix, I am seeing another strange behaviour:

For any reason, some of the files I added during my tests are missing the suffix in the iPhone.app (latest version)

In the WebUI and Desktop app suffixes are shown properly…

white text

Also: when trying to upload a file from the iOS.app, all input-fields have WHITE text (on light-grey background). That’s most likely, because my iPhone is set to the DARK theme (but still a bug in the iOS.app).

Sorry to put everything in here - I’m currently on the go and very short on time, but still want to report it…

A document can have only one file extension (suffix), so when you see ‘.pdf’ twice it is because ‘.pdf’ has been appended to the document title. DEVONthink to Go doesn’t display the file extension for documents, so when you see ‘.pdf’, it’s because it is part of the document title, which can be edited to remove ‘.pdf’.

On the display issue, it’s not a bug in the DEVONthink to Go app, it just doesn’t have Dark Mode support yet. The devs have said that they are working on it and full Dark Mode support is coming.

Hi @Greg_Jones,

thanks for the clarification - strange though, because I don’t know, how the ‘.pdf’ part made it into the dialog-title (but since I can’t repro atm it’s not important).

Is there a setting in the iOS.app to show suffixes? If not: how can I add a feature-request for this?

No setting that I’m aware of. Just speculating here, but I expect the fact that file extensions are not displayed is due to the limited screen space on iOS, and displaying the extension would add 4 more characters to the document name.

@eboehnisch would be best person to take feature requests, and since I’ve tagged him he will see this.

To me file-extensions are important, because in my file/folder-structure there are often files grouped with the same name, but a different file-type (for different purposes):

  • demo.md
  • demo.doc
  • demo.pdf
  • demo.epub
  • demo.jpg

    Since I’m often ‘on the go’ and need to access (or share) files from the iPhone, it is pretty important for me to see the file-types. As is, I can’t really import these files into DT3, because I can’t easily differentiate the files. So hopefully a setting can make it into the app.

You do get the file type info, if sorted into categories as you have pictured above. If you prefer to sort by a different classification, the file type is displayed under the document name (assuming one has not added tags). The document’s icon is also an indication of file type, which is what I usually go by as I normally sort by name and have tags assigned to my documents.

On iOS with its limitations to screen real estate and Apple’s clear tendency to simplicity we decided to show titles without file name extension. The left list has a fixed with, like in all apps based on the split view, and so file name extensions would add more visual clutter than benefit to the majority of users (we believe).

I can see that they are important to you, @tiptronic, but we try not to overload DEVONthink To Go as an iOS app with a gazillion of settings, switches, and options. iOS is based on simplicity, even iPadOS.

1 Like

@eboehnisch I agree with your points to not overload the UI, but on the other hand, there’s a design principle called ‘form follows function’, which - in software design - means you should be able to (intuitively) use the software made for it’s (desired) purpose.

Since DT is mainly about collecting documents, document-management features should ‘win’ over simplicity, because the main purpose of DT is ‘collecting and organizing documents’. The very nature of document-management software is, that every user has their own strategies and preferences, so finding a ‘sweet spot’ (heuristically) for everybody is a tough task.

In this case, however, I don’t see the problem: The ‘i’ pane at the bottom already lets you change various settings (e.g. ‘show hidden objects’ - which in my understanding is much less important than showing a file’s suffix, which is a hint on the file’s purpos, whereas ‘hidden’ is just a unrelated viewing preference.). So adding a line ‘Show file suffixes’ won’t clutter your UI at all, because you would add just one line to the list… and since the file-suffix is an important part of document management, there should be a way to make it visible.

@Greg_Jones I don’t agree with you at all:

  • The file-type in DT for IOS is only shown on some document types.
  • some show the same type (e.g. .txt and .md),
  • some show the same icon (e.g. ‘.png’, ‘jpg’ and scanned ‘.pdf’ )
  • some show their type (but only pretty small and using medium-grey font-color, so they are really hard to read)
  • some show their icon, but here’s a quick test for you (please don’t cheat :wink: ): How does an Excel-icon looks like? Or Numbers? How quickly can you differentiate a Word icon from a Pages icon?

And then: how about this:
FileOne.doc
FileOne.xls
FileOne.pages
FileOne.numbers
?

All that being said, just please add a setting to show file-suffixes - they simplify things a lot.

And here’s just one more example out of my daily life: I quite often search (in Finder on the Mac) for e.g. ‘Sketch’ files I changed one week ago.’ For this very search I also use the file-suffix, because that’s what I want (no Sketch-plugins, -patterns, -libraries or whatever was created by Sketch, but only the design-files).

Sorry, this reply has gotten longer than expected, but I wanted to underline the importance (for me).

I agree with @tiptronic. Such a config option would be worthwile for a lot of people, me included, who also combine the same base file name with different file type extensions.

Here’s one more issue I just now ran into:

I was looking for some recent scans. So I was searching for ‘scan’. Here’s what I got:

(the ones showing .pdf are some, where a .pdf was added while importing for some unknown reason).
Then I tried to extend the search to scan pdf, but there also, none of the pdfs where found, only the ones which carry pdf in their name.

It is really hard to say (if at all possible) what the file type is (except for those ones where DT added ‘.pdf’ to the filename, which resulted in 2019-11-10_Scan_.pdf.pdf inside DT3. But I want ALL scans (jpg, pdf, tiff, whatever…)

Creating a group All pdfs brings the desired result - but ONLY for pdfs.
So right now the situation is: If I want to find all scans, I can search for scan, but then I can’t differentiate those matches in the list, or I can search for scan in dedicated groups, which leave out scans of a different type…

Btw: I just found, that I have the same situation with my tax-statements: There are the raw exports from my accounts, which are .csv tables, the corresponding summaries are .xls tables and the account-overview is (again) .pdf. All carrying the same name (provided by my Mastercard bank).

So, basically there should be other an option to filter also by document kind or a less “human friendly” but more accurate display for the file type. Is that what it boils down to?

I’d still prefer the option to show the file-suffix - because in lots of cases that already avoids to use a filter, because the file is already in view (e.g. Inbox).

But then, YES, filtering for file-endigs would be pretty helpful as well…

(btw. I don’t agree to the less human friendly phrase, because to me a written word is as human-friendly (if not better) as an icon. (this might have been true 30 years ago, when there were only a handful of file-endings and graphical UI was a big deal :wink: ))