Best practices to create epub from pdf

amelchi · December 23, 2021, 9:46am

any suggestion?
starting from DT (pdf already OCRd…) ?
thanks

rmschne · December 23, 2021, 9:55am

For you, I searched “epub” in the “DEVONthink Handbook” and I see nothing about it having capabilities of creating epub from anything. It can index/search/view. I suggest you start with a search on the “net” for this capability. I notice a lot of web sites and there surely are other tools you can purchase to do what you want, but out of scope for DEVONthink.

chrillek · December 23, 2021, 10:48am

In general, producing ePUB from PDF will not be feasible. ePUB is basically HTML, aka a structured format. PDF is anything but structured.
There might be exceptions, but I doubt that its worthwhile investigating further.

Blanc · December 23, 2021, 11:05am

You might want to see the calibre documentation on converting specific file types; scroll approx. half way down for PDF. The website states

To re-iterate PDF is a really, really bad format to use as input. If you absolutely must use PDF, then be prepared for an output ranging anywhere from decent to unusable, depending on the input PDF.

Disclaimer: I haven’t used calibre myself - it just came up when I had a quick duckduckgo on how to convert pdf to epub.

chrillek · December 23, 2021, 1:41pm

It’s not really a Calibre specific issue. PDF is like a graphics format: Basically, you can say

goto 10, 20
set the font to Helvetica Bold 20 point
write "Hi "
store the current position in x
set the font to Helvetica 12 point
write "I'm going crazy with all this"
restore position from x
set the font to Helvetica Bold 20 point 
write "there,"

In HTML, you’d want to have something like (and it would be a lot of fun already to assemble the h1 element from the PDF above)

<h1>Hi there,</h1>
<p>I'm crazy about PDF</p>

and you’d have the corresponding CSS (approximately)

h1 { font: "helvetica"; font-size: 20p; font-weight: bold}
p {font: "helvetica"; font-size: 12p; font-weight: normal}

Now imagine the PDF contains a 20 point piece of text in Times Roman – is that another h1 element? Or just something that the original author wanted to appear in 20 point Times Roman?

Basically, PDF is like drawing whatever comes to your mind whenever it comes to your mind (as long as it stays on the same page). HTML is about structure, it has no idea of pages, and representation is preferably managed by CSS, not in the HTML itself. Also, HTML can reflow (i.e. you can make the window smaller and the text follows). No such luck with PDF.

Blanc · December 23, 2021, 1:45pm

Sure; calibre just happened to be available as an example. I preferred referencing it over the online services which also claim to offer such conversions (and which appear first when using pertinent search terms); I would personally not choose to trust any such services with one of my PDFs.

pete31 · December 23, 2021, 2:45pm

You could try to

convert PDF to RTF in DEVONthink
remove unnecessary stuff from RTF
use online RTF to ePub converter

BLUEFROG · December 23, 2021, 3:52pm

Why do you want an EPUB if you already have a PDF?

amelchi · December 23, 2021, 4:05pm

it is easier to read on iOs devices…

pete31 · December 23, 2021, 6:14pm

If that’s your goal then you could split your PDFs into RTFs

convert PDF to RTF in DEVONthink
remove unnecessary stuff from RTF
use Script: Split RTF(D) at Font Sizes

The conversion from PDF to RTF is not perfect but it’s a nice way to make stuff easier readable. I don’t delete the PDF, it’s really only for reading on iPhone.

In cases where I want to keep the split RTFs (instead of deleting them after reading) I open the unsplit RTF in Nisus Writer and remove page numbers etc. via regex. It’s a bit more work but this way one can get quite good results.