PDF generated by Calibre not recognised as having text

Quatrel · May 18, 2013, 8:54pm

I am trying to see why I am doing wrong, but I am currently a bit stumped …

I have converted Kindle files to .mobi & Pdf by using Calibre. When I try to import or index these ebooks the Pdf files are listed as hang no text in them and have to be converted with the DevonThink OCR utility first.

Funny thing is though, when I look at the Pdf files with Acrobat, it can’t find anything at all to oct and all text is selectable.

If I open the Pdf in Devonthink, I can also select text, without converting first.

So all these Pdf files are not indexed until I convert - which is a bit of a pain on 5’000 files.

Why is this happening? Is it because of the cover page perhaps?
I would appreciate some ideas what else I can try … in the meantime it is converting those files - slowly- … lol …

Quatrel · May 19, 2013, 10:52am

Well, I can confirm that it is the cover of the Pdf that is the problem in getting DTP to recognise the file as having text in it.

It seems DTP only looks at the first page and if it does not contain text then it does not look at the following pages.

The conversion routine in Calibre does not seem to have a feature to skip the generation of a cover for the Pdf.

So right now, there are two workarounds:

Have DTP convert all Pdf files to text - really slow.
Select and open all Pdf in Acrobat and then do Text recognition only on the first page and then save the file. This saved file is recognised as a PDF+Text.

korm · May 19, 2013, 5:55pm

Delete the cover?

Do a batch OCR conversion in Acrobat (usually faster)?

Did you de-DRM the content in Calibre?

BLUEFROG · May 20, 2013, 12:56pm

You can print to PDF from Calibre using a range of “2 - n” (n being the last page but this needn’t be exact, just as high or higher than the last page available).

You can also copy the front cover by right-clicking and choosing “Copy Image” in Calibre. Opening the PDF with no front cover in Preview, you can select File > New From Clipboard then drag the thumbnail from the newly created cover PDF to the coverless PDF and save it (if you really feel you have to have that cover).

Note that not all PDFs from Calibre will behave this way, ie. not coming in as PDF+Text.

Quatrel · May 20, 2013, 4:25pm

Thanks for the suggestions.

The Files had all DRM removed.

A funny thing was happening though - ⅔ of the files were recognised as PDF+Txt after doing text recognition on the first page in Acrobat. The other ⅓ would not show PDF+Txt ever after doing OCR on the whole file in Acrobat, or doing the OCR conversion in DTP.

Checking this was taking too long so, I opted for the crowbar-approach …

Because of this hit & miss situation, just generated RTF files in Calibre and so avoided the issue, since I just need text and not layout. The pages with tables are a bit messy though.

It would be nice though if DTP could read .mobi or other ebook formats directly.