Dealing with Non-Searchable PDFs

mmoren10 · July 18, 2020, 4:48pm

Hello! I am using DEVONagent to search for PDF files of scanned documents that have not been OCRed.

I am using a custom Search Set that simply uses Google to search “[query] site:[desiredSite]”. It seems that Google’s indexing has some sort of OCR capabilities, so this step is usually successful even in non-searchable PDFs. The Google search returns some results.

The problem comes next, when DEVONagent tries to apply the secondary query to my PDF documents, as it finds no text at all and therefore discards the documents.

Is there a way to either:
a) OCR the Google search results before moving on to the secondary query. I have DT, but could I hand the files back to DEVONagent once OCRed?; or

b) Access Google’s OCR index to further refine my search within DEVONagent. Usually, I can just see the paragraph where Google’s search hit is in; or

c) Any ideas that might fit these needs?

Thank you!

BLUEFROG · July 18, 2020, 5:35pm

No, these ideas are not feasible in DEVONagent at this time.

mmoren10 · July 23, 2020, 4:59pm

Would be great for a future release. Thank you!

BLUEFROG · July 23, 2020, 5:11pm

Bear in mind, these things are likely done via Google’s APIs which may not be accessible or only accessible on a licensed basis.