Best way to consolidate decades of research saved web pages

gred22 · March 24, 2022, 3:00pm

I do a lot of research, and as part of that, I save a lot of web pages. I found many years ago that the best way to do that accurately was to use a firefox plugin called ScrapBook X. For some reason the web page captures to other programs, like Evernote or Scrivener, more often than not were defective in one way or another, especially ebay pages for some reason.

So, in over two decades, I have close to three dozen data sets of web pages. I decided to do a simple drag/import test of one of these data sets, with 16,403 web pages. The main data folder has a subfolder for each web page saved, with the folder name the date and time of the save.

There is also an .rdf file that contains organization data for subject subfolders within ScrapBook X, but I do not expect there to be any way for DEVONthink to be able to decode and use that file.

Clicking on the main index.html within one of the folders inside DEVONthink does indeed pull up the web page accurately in DEVONthink.

However, before continuing any further with the other gobs of data, I thought one of you DEVONthink experts might have some tips or techniques to be able to end up with a better one set of decades of web pages, AND be able to easily locate needed data.

PS - it took about 7 hours to import that one data set - not a problem. But, it is still indexing and will take much longer. No problema. But, maybe there is a better way to get all my data together?

cgrunenberg · March 24, 2022, 3:08pm

How many files and MB/GB did you actually import?

gred22 · March 24, 2022, 4:25pm

A total of 32.8 GB of data (37.1 GB on disk) with 1,668,166 files. As I said, there were 16,403 folders, each representing a saved web page. But each of those folders has many (usually very small) files - png’s, etc.

cgrunenberg · March 24, 2022, 5:42pm

That’s a lot of data. Usually up to 300.000 items are recommended per database depending on the speed of the computer and its amount of RAM. In addition, please ensure that there’s enough disk space left on the startup volume.

ibuys · March 24, 2022, 7:16pm

Every now and then I think “boy I sure put DEVONthink through it’s paces.” Then someone like @gred22 shows up and I’m reminded that my use of DT is very, very light weight.

I love hearing stories about folks pushing the boundaries of what DT can do.

gred22 · March 29, 2022, 1:00am

Thanks for the responses - I got tied up trying to get 369 GPS tracks from a decade of archeology studies in Spain out of my ipad after the app was canceled.

I have plenty of disk space, but believe it or not - DEVONthink still has a little to go on indexing.

I guess what I was looking for was if anyone else had imported saved web pages from another app - and might have a better way to do what I am doing.

mdbraber · March 29, 2022, 4:51pm

If you’re talking about GPS tracks, it might be highly interactive content. But if it’s more static content, I think I would have gone for exporting everything as PDFs using the PDF capture as has been described in other places on this forum. Applescript would make it rather easy to automate steps like “open index.html in every folder and capture PDF”