Huge Kindle searchable files

lacomaco · September 5, 2015, 9:18am

Hi,

I converted more than 700 of my Kindle books into pdf through Calibre and imported them to a DT database. When I tried to search for text through them I realized that it is not possible, so based on some advice in the forum I started to “convert them into searchable pdfs”. Now I had to realize that the size of the file increased from 1 Mb to 500-600 Mbytes.

Therefore to have all the 700 Kindle books as a searchable pdf will be a minimum of 350 GB…

Is there a way to decrease their size?

Thanks in advance
Laszlo

korm · September 5, 2015, 8:00pm

Some applications such as Preview or Acrobat have utilities that can compress and optimize PDFs, reducing their size. However, 350 GB or even half or a third of that is extraordinary for 700 books. Something else is wrong. Are you sure Calibre is working right? Have you asked them why their PDF conversion from one text format (MOBI) to another is not searchable? Are these 700 picture books?

This seems like a failure in the workflow upstream from DEVONthink.

chdevonthink · September 6, 2015, 1:55am

I recently found out that DTPO’s built-in searchable PDF creation tool is not very good; it creates rather large PDF files. Luckily, I have access to Adobe Acrobat and can use that to create searchable PDFs. In one case, I used DTPO to create a searchable PDF and the file ballooned from 950kb to about 12MB. Doing the same with Acrobat only made the file about 1MB to 1.25MB large.

The only disadvantage with using Acrobat is that I cannot control-click on the file within DPTO to create the searchable PDF; it takes a few more steps.

lacomaco · September 6, 2015, 3:42am

Korm and Chdevonthink,

I think Calibre worked fine as the size of the non-searchable pdf was 1-2 M only, so probably DT does something wrong.

Thanks for your help, I will try Adobe Acrobat.

Allsop · September 6, 2015, 5:02am

This is most interesting and it made me question why my default practice when saving documents was to save them as searchable PDFs? The answer in most cases is habit, a left-over from when I needed to share a lot of stuff with other people and pdf was a requirement. Now that is not so much the case Rich Text would be just as good for me and I could easily convert any document to PDF if later I needed to.

This epiphany naturally leads to the question is there a way to easily covert my searchable PDFs, many of which contain text and pictures, to RTFs? It is, of course, the pictures that creates the problem as the menu item convert to rich text does not include them in the converted document.

lacomaco · September 6, 2015, 5:54am

Or, a non-searchable pdf to convert to RTF… Which would also solve my problem

Mio · September 6, 2015, 10:29am

The way I am able to decrease file size after I convert it to a searchable PDF file via Calibre app is to convert that file to an rtf file or a Microsoft word file in case the file has pictures I need. Once it is converted to a word file, I would convert it back to a PDF file and voila, the file size significantly decreases.
Unfortunately, you need Adobe Acrobat Pro app for the above procedure.

korm · September 6, 2015, 10:32am

@Allsop – what is gained?

There are a few PDF-RTFD converters in the Mac App Store. Acrobat or PDF Pen Pro can convert PDF to .docx.

korm · September 6, 2015, 10:41am

Acrobat is not necessary. PDFs can be resized for free.

If you open a PDF in Preview and Export it you can choose the “Reduce File Size” Quartz filter. (Use “Export” not “Export as PDF…” in Preview.) This reduces size to between half to 1/3 of the original, in my experience.

To do this with a batch of files, another option for reducing PDF size is to use the “Reduce File Size” filter in the “Apply Quartz Filter to PDF Documents” action in Automator. Make sure your Automator workflow is making a backup saftety copy so that the process doesn’t trash the original. Delete the backup after you’ve examined the result.

Allsop · September 6, 2015, 10:48am

Thanks Korm. What is gained is a greatly reduced file size.

Allsop · September 6, 2015, 10:57am

I have just done a test using this method and the file size has been greatly reduced from 12.7 MB to 1.4 MB! Thanks for the tip Korm.

korm · September 6, 2015, 11:33am

It would be useful if DEVONthink included a “reduce size” Automator workflow as part of the Support Assistant “extras” since overly large PDFs is a common comment in the forum.

Allsop · September 6, 2015, 2:21pm

+1 Sounds good to me.

chdevonthink · September 6, 2015, 10:49pm

Converting a non-searchable PDF to RTF isn’t necessarily going to be smooth. The line breaks will be all in the wrong places unless the PDF document was originally made to reflow. But then again, non-searchable PDFs are not made to reflow, and only some searchable PDFs are made to reflow.

FROBGOBLIN · September 6, 2015, 11:54pm

If you are just looking for the text content, you can textify a pdf as well. This is probably the ultimate way to shrink a pdf down to size.
christopher-mayo.com/?p=551

I keep a copy of the original in one location and the textified version in another. If I find something in the text version and I need to look it up in the originalfor use in my research (usually the case), it’s pretty easily accomplished. The DT overall database size is kept smaller this way, though the dense text content may affect performance if you add thousands of such files.