Whole Wikipedia in DTP

Here is my dream; I want to put entire wikipedia to DTP. It is said here:
static.wikipedia.org/
that the whole archive is 27GB of html’s. I know that it is beyond abilities of any Mac on the market.
I just wanted to hear others’ opinion on when it can happen. How much RAM, CPU power etc. it will need to run smoothly? When can we expect it to happen?
Does anybody have a more fun idea?

Best
pj

Using the latest beta of DT Pro 1.1 I’ve just indexed the whole ADC Reference Library (for developers) containing around 43.000 HTML files, more than 5000 folders and around 660 MB of HTML data.

The resulting database contains around 200 MB as indexing of V1.1 will be compatible to phrase searching and therefore require a little bit more space/memory. The overall performance is fine so far, e.g. opening the database needs 3-4 seconds, searching for “NSString” needs 0.005-0.2 seconds (depending on the search options and if it’s the first search).

Anyway, indexing 3.5-5 GB of HTML should be possible if the computer has 2 GB of RAM. Handling more data would require simplified databases for this job (e.g. using a case insensitive index not compatible to phrase searching and ignoring stop words should at least double the amount of manageable data).

Not storing meta data (like creation/modification dates, file aliases etc.) inside the database would further increase the limit. But that’s all theoretical at the moment :wink:

If anybody finds it useful simple English Wikipedia is much smaller in size (142MB) therefore manageable. Also if you happen to be able to use a database in a different language, other wikipedias are small in size as well.
Does anybody have a usage scenario for these?

pj

I ended up indexing the recent CD version which contains ~2000 articles chosen for secondary schools but pretty much covers all basic subjects. Uncompressed total of HTML pages is 44MB. It didn’t give much burden to my database so far (though added about 3M words) and improved See Also results with relevant reference material.

You can browse it from:
fixedreference.org/2006-Wikipedia-CD-Selection/

And download from:
soschildrensvillages.org.uk/ … ion-cd.htm