I first want to start that I really love DTPO. Over the past few months I got to use it more and more and I am surprised every now and then about the features covered in this app. This is for me now the most used app.
There is one thing though, were I am struggling with and that is the concordance for pdf files. I have some files which show in the concordance view very long words (see in attached picture of 49 characters). In the second picture you can see the the corresponding text in the pdf file. It is shown here that the “word” consists of 11 different words. This is not a single occurrence as you can see from the picture. It happens for a lot of pdf files.
I have come up with a workaround which needs some manual interaction but leads to better results overall for search and concordance view. Here is my process:
- I am dealing with long ebook files in pdf format.
- extracting the content and index pages
- run a python program to extract the useful words from the text (remove stop words and do some other stuff to improve quality)
- copy the result of step three into the annotations part of the Annotations and Reminders inspector
- now I can use the annotation files for a single document or a group and work with concordance
The above steps take some time to do but the benefit is high quality data.
My questions are:
- is there a way to improve how DT handles pdf files. Not concatenating words into a single long word
- has anyone ideas to improve my workflow above to create high quality data for concordance
Thanks for reading my post and looking forward for feedback.!