Accessing zipped content

Greetings,

I’m just trying out DTPro as a way to catalog and archive a lot of widely varying material and am trying to understand how I would make it do everything I need. One issue I have come across is that I have a significant number of zipped files (mostly .gz) in archived websites that I may want to access from time to time. When I open these from within DTPro, they are unzipped into the folder in the database package, but do not show up in DTPro. I thought that synchronizing might help me here, but that does not make them show up either. What I was hoping is that the unzipped version (.ps in this case) would go into a temp directory and be opened automatically. It would also be nice if zipped content were searchable with DTPro. Any hints on how to streamline what I am trying to do here?

Thanks,
Tim Nelson
Stanford Linear Accelerator Center

DT will not unarchive the contents of compressed packaged in order to index their contents. Therefore, if you want the contents of these zipped files to be visible for the AI, you’ll need to index/import the unzipped contents.

Thanks,

Visible to the AI is one thing (a wish) but visible and accessible from the DT interface is quite another (an expectation)…

I understand that DT will not unarchive/index files on it’s own. That is more of a long term wish. I really need an answer to my other problem, though: the gymnastics that I must go through to simply VIEW the contents of any of these files stored in the database. Remember that these files are part of an archived website, so there are links that point to the these files in the pages of the site.

Forgetting zipfiles for the moment… as it stands, if I have a link to a file, call it foo.pdf, I can navigate through the site on disk as expected, but when I click on the link to foo.pdf, DT tries to download the copy off of the internet, instead of reading the copy that is already there inside the database. This defeats the purpose of archiving the website in the first place. Yes, I can still navigate to that directory in DT and open the file, but that should not be necessary. I am really confused as to why DT cannot correctly resolve the link to the file on disk. Basically that means a lot of digging around any time I want to access content via the interface of the archived site, and that seems like basic function to me.

Worse yet, if I have a link to bar.ps.gz… if I click on that link it also looks for the copy on the internet even though the file is in the database in the correct place. If I manually go to that directory to view the file, the BOMArchiveHelper expands the file into the same directory of the database, but DT has NO IDEA of this, so I must:

  • “Show in finder” to see the directory

  • Manually open and/or drag that file into DT in order to access it

  • If I don’t add the uncompressed file to the database, I’ll have to dig around again the next time I want to see it.

It is hard for me to believe I am not doing something wrong here. At the very least, I expect DT to be able to resolve an http link to a file that is in the archived website! If not, I cannot imagine a lot of work is needed to improve upon this situation. This is pretty basic stuff compared to much of what DTPro is capable of!

Best,
Tim

Hi Tim,

I understand your problem. First of all, if files aren’t visible to the AI, you won’t be able to find them in DT.
Secondly, if you archive a website and the links are absolute to “http ://xxx” then DT cannot go anywhere else but the internet since that is what the link refers to. If they are relative links but the main page link refers to “http ://xxx” then it won’t work either (identical problem).
Thirdly, given that all this is not the issue, i.e. everything is relative to the start page, you’d need to index and not import the site. Now it happens automatically I believe but if you’d done this in a version prior to 1.1 it may have been imported. Then it may work, but Bill and Christian will have more experience in this.

Since you have the web sites on your disk, you could also try to use the site-sucker functionality from the Download Manager. When you click the “+” to add a URL, use the “file://” URL type that points to your website. Try to see if that will help you. I think that may give better results. Of course, you can’t move the files to a different folder unless you change the file URL(s) of the affected records.

Thanks,

I’m sure I’ll figure out an OK way to deal with it. I am, of course, using DT Pro 1.2.1, and I have created the archive by starting from the main page of the site, and importing recursively all subdirectories, and checked all file types in the options.

In this way, it appears I have a complete copy of the website (which is pretty much all static), although I am a little puzzled about a few things.

First, it clearly got into an infinite loop at some point, and kept downloading some of the files repeatedly and putting lots of replcants in the archive (one file, lots of database entries). It is not at all clear to me why it decided to do this with some files and not others, which appear to be the same w.r.t. file type and the organization of the site. At some point, I recognized that it had gone mad, stopped the download and cleaned up the replicated copies.

The second surprise is that although it appears to have the .html somewhere, I do not see it in the database package file. I haven’t made a concerted effort to figure out where those files are but am puzzled as to why I cannot simply “show in finder” like the other files of the site. However, I can disconnect the network and still browse the site so these must be somewhere (unless they are cached).

The final suprise was my inability to access some of the content by link, which is odd since all the links are relative. I’m still trying to find my way around DT (which I’ve only played with for a few hours) so I may well be misunderstanding what is going on.

Thanks Again,
Tim

More observations…

From some more-ing and grep-ing, I now understand that .html and .rtf are stored within the binary-format database file. That explains the absence of these files as regular files within the package. There are still some mysteries…

When I click on a link to a .ps file in the archive, I get a gray background but the .ps is never rendered. However, if I navigate to the referred-to .ps in DT, it is converted and views just fine when I select it.

For all files but my .ps.gz, selecting the file renders it without any internet access. However, the .ps.gz show up as link in DT (blue underlined) and selecting them attempts to access the internet, even though the files are local and double-clicking opens the .ps.gz with the BOMArchiveHelper and generates a .ps in the database package (which DT does not see…)

I think there are some bugs in here, but I’m not familiar enough with DT yet to tell bug from feature.

-Tim

There was another related thread recently, try not to import but to index your content. That should do the trick regarding links.

Thanks Annard,

Being so new to DT, I’m not sure I know the difference. I used the Download Manager in 1.2.1 to read the entire site including all file types and recursively in all subdirectories. What do I need to do instead/in addition?

Best,
Tim

I did some testing and I discovered the following:

  1. The Download Manager doesn’t support “file://” URLs.

  2. I had a documentation directory on a bunch of Java code that I had written and used the “File->Index” command to put it in the database. Result: all the links worked.

So please give option 2 a whirl. :slight_smile:

Hi Annard,

I am happy to try this, but I am a bit confused. I used the download manager to pull the site into DT (which lives on a remote server), so it is already in the database. When I try to import, it wants me to select a file/files to import, but of course there is no way to “crack open” the site which has already been incorporated into the database. Are you saying I should use wget (or another tool) to get the site on my disk and then index it or am I missing something?

I no longer have an account on the server that hosts the site (former employer), which is why I want to catalog it all in DT.

Thanks,
Tim

Sorry I was under the impression that you had the site locally on your machine. oops.

In that case, if you really want a complete copy on your local machine I would indeed use wget with rewriting the URLs. Bill tells me Acrobat is also a good tool for this but I don’t have it and I prefer command line tools anyway.
Check the forum on this topic, you may find other posts recommending other tools but since you mentioned wget, I guess you’re familiar with it.

tknelson, there are option settings in DT Pro’s Download Manager that let you specify whether or not you wish to download to your disk file types such as .doc, .pdf, images, etc.

Unless those linked file types were checked, they are not physically downloaded to your drive, but remain externally linked via the URLs on the original Web site.

But if you did check them for download, they will be on your disk and available for offline viewing on your computer.

Hi Bill,

As I said in earlier posts, I checked every single box including “unknown”. It still seems to look on the web for some things when selected, even though double-clicking causes the file to open from disk.

-Tim