@joshgibson, there is a long - and at times somewhat tortuous - thread here which examines many of the ins and outs of the various formats in which DT can capture and store web content. One lesson that I (who am a culprit in the discussion) learnt is to be really sure why you want to get URLs, webarchives, PDFs etc.
Happy to share a précis of my findings if it’d help you; Pete’s scripts and understanding of the issues are invaluable; good luck!