Concordance for PDF files

I first want to start that I really love DTPO. Over the past few months I got to use it more and more and I am surprised every now and then about the features covered in this app. This is for me now the most used app.

There is one thing though, were I am struggling with and that is the concordance for pdf files. I have some files which show in the concordance view very long words (see in attached picture of 49 characters). In the second picture you can see the the corresponding text in the pdf file. It is shown here that the “word” consists of 11 different words. This is not a single occurrence as you can see from the picture. It happens for a lot of pdf files.

I have come up with a workaround which needs some manual interaction but leads to better results overall for search and concordance view. Here is my process:

  1. I am dealing with long ebook files in pdf format.
  2. extracting the content and index pages
  3. run a python program to extract the useful words from the text (remove stop words and do some other stuff to improve quality)
  4. copy the result of step three into the annotations part of the Annotations and Reminders inspector
  5. now I can use the annotation files for a single document or a group and work with concordance

The above steps take some time to do but the benefit is high quality data.

My questions are:

  • is there a way to improve how DT handles pdf files. Not concatenating words into a single long word
  • has anyone ideas to improve my workflow above to create high quality data for concordance

Thanks for reading my post and looking forward for feedback.!

Screen Shot 2021-10-09 at 12.40.34 PM|690x194

here is the second screenshot.

The Concordance displays words indexed from the text layer of the PDF. Use Data > Convert > to Plain Text on a PDF to see the contents of the text layer and more what you see there. I’m guessing it may be a product of your Fireshot + OCR process.

Thanks for the reply. The above example does not include FireShot. It is an ebook bought from a publisher in PDF+Text format.

I did the conversion on this text as you described and the result shows the concatenated words. But the PDF+Text version shows the words separated by spaces. See attached image above.

My question now is, what does Data > Convert > to Plain Text do. Is it extracting the text layer or converting it from the document.

This issue happens with ebooks from different publishers. All of them are in PDF+Text format.

The display you see doesn’t matter to the underlying code. What you perceive as distinct words isn’t necessarily what’s in the text layer of the PDF, as is the case here.

Development would have to comment on the technical aspect of the plain text conversion.
@cgrunenberg ?

Thanks for coming back to me. Yes, you are right, the problem is with the text layer of the PDF and is not related to DT.

The conversion is done by the PDFkit, do you get the same results after e.g. selecting the text in and copying it?

Yes, the result is the same. Text copied from results in one long word with no spaces.

Then it’s indeed an issue of the PDFkit in case of this document (probably due its layout, fonts and/or internal structure). You could try to apply OCR to a copy, maybe this will provide better results.