Concordance for PDF files

manfred9 · October 8, 2021, 11:59pm

I first want to start that I really love DTPO. Over the past few months I got to use it more and more and I am surprised every now and then about the features covered in this app. This is for me now the most used app.

There is one thing though, were I am struggling with and that is the concordance for pdf files. I have some files which show in the concordance view very long words (see in attached picture of 49 characters). In the second picture you can see the the corresponding text in the pdf file. It is shown here that the “word” consists of 11 different words. This is not a single occurrence as you can see from the picture. It happens for a lot of pdf files.

I have come up with a workaround which needs some manual interaction but leads to better results overall for search and concordance view. Here is my process:

I am dealing with long ebook files in pdf format.
extracting the content and index pages
run a python program to extract the useful words from the text (remove stop words and do some other stuff to improve quality)
copy the result of step three into the annotations part of the Annotations and Reminders inspector
now I can use the annotation files for a single document or a group and work with concordance

The above steps take some time to do but the benefit is high quality data.

My questions are:

is there a way to improve how DT handles pdf files. Not concatenating words into a single long word
has anyone ideas to improve my workflow above to create high quality data for concordance

Thanks for reading my post and looking forward for feedback.!

Screen Shot 2021-10-09 at 12.40.34 PM|690x194

manfred9 · October 9, 2021, 1:01am

here is the second screenshot.

BLUEFROG · October 9, 2021, 11:57am

The Concordance displays words indexed from the text layer of the PDF. Use Data > Convert > to Plain Text on a PDF to see the contents of the text layer and more what you see there. I’m guessing it may be a product of your Fireshot + OCR process.

manfred9 · October 9, 2021, 5:43pm

Thanks for the reply. The above example does not include FireShot. It is an ebook bought from a publisher in PDF+Text format.

I did the conversion on this text as you described and the result shows the concatenated words. But the PDF+Text version shows the words separated by spaces. See attached image above.

My question now is, what does Data > Convert > to Plain Text do. Is it extracting the text layer or converting it from the document.

This issue happens with ebooks from different publishers. All of them are in PDF+Text format.

BLUEFROG · October 10, 2021, 3:19pm

The display you see doesn’t matter to the underlying code. What you perceive as distinct words isn’t necessarily what’s in the text layer of the PDF, as is the case here.

Development would have to comment on the technical aspect of the plain text conversion.
@cgrunenberg ?

manfred9 · October 10, 2021, 7:36pm

Thanks for coming back to me. Yes, you are right, the problem is with the text layer of the PDF and is not related to DT.

cgrunenberg · October 11, 2021, 7:21am

The conversion is done by the PDFkit, do you get the same results after e.g. selecting the text in Preview.app and copying it?

manfred9 · October 11, 2021, 5:42pm

Yes, the result is the same. Text copied from Preview.app results in one long word with no spaces.

cgrunenberg · October 11, 2021, 6:55pm

Then it’s indeed an issue of the PDFkit in case of this document (probably due its layout, fonts and/or internal structure). You could try to apply OCR to a copy, maybe this will provide better results.