Using the Download Manager

astro · December 2, 2020, 4:40pm

Hi all,

I would like to have the complete archive of motions of the German parliament in DT3.

The Deutsche Bundestag puts them online in a quite systematic fashion:

The index of the motions is numbered as follows: 19 / n

19 is the Government period and then number of motion.

They put them accordingly in a folder structure on their servers. For example the very first document of the new Parliament in 2017 was 19/1.

It can be found in the folder : …/19/000/1900001

By the end of November it was 19/24779.

Online to be found …/19/247/1924779

So I thought it would be possible to add the url http://dipbt.bundestag.de/dip21/btd/19/ to the download manager and grab the pdf from the subfolders.

But I get an 403 in the Download Manager window and nothing happens.

Any idea why that is? Is there a solutions since the folder structure is very systematically organised?

Thanks for comments in advance.

chrillek · December 2, 2020, 5:04pm

I get an 403 error with the link you posted in the browser as well. So presumably the link is wrong or the Bundestag is not working properly.

chrillek · December 2, 2020, 5:10pm

For the record: one of the links I found looks like this

Is that possibly what you meant (it is from the “Anträge” section on the right of the website)

astro · December 2, 2020, 6:00pm

Correct.

It is possible to get in various ways, respectively addresses

deserver.bundestag.de as you did.

or via

dip21.bundestag.de.

see here:

or on dipbt.bundestag.de

If I put dserver.bundestag.de into the download manager, it works but I get the full website of the Deutsche Bundestag. That would exceed all disk space I got.

But since I tried it showed me this

the dip21 subfolder is the one I want to grab but not the rest.

I hope I described it in a understandable fashion

BLUEFROG · December 2, 2020, 6:02pm

I’m not sure why you’re getting the 403 offhand, but here is a DEVONagent search set that pulls the PDFs and puts them into DEVONthink’s Global Inbox:

Bundestag.agentSet.zip (727 Bytes)

Here are results I saw…

Note: You may get no results if the site thinks you’re a robot. Click the Links tab, double-click the google search and see if you see a Captcha. Okay that you’re not a robot and try again.

astro · December 2, 2020, 6:29pm

WOW!!!
How great! thank you very much!

May I ask, why are there only 300 files?

Is this limit a setting by the engine?

In case of the Bundestag there should be by definition approx. 24576 results/files

chrillek · December 2, 2020, 6:35pm

I’m sorry, I didn’t realize that discourse would immediately solve the link and thereby make it unrecognizable. I was talking about this
https://dserver.bundestag.de/btd/19/161/1916186.pdf
instead of your variant. It looks very much the same as you version, but it doesn’t give a 403.

And of course I didn’t want to suggest that you kill your harddisk by downloading all proceedings I don’t know Download Manager, but if you tell it to grab
https://dserver.bundestag.de/btd/19
shouldn’t that work, too?

BLUEFROG · December 2, 2020, 6:35pm

That’s all DEVONagent returned.
Where are you seeing this other number?

Also, note some items may be reported as Too Big. You can adjust this in DEVONagent’s Preferences > Search > Max. Download Size.

astro · December 2, 2020, 6:45pm

Sorry would be maybe 24779 since this seems to be the last document they uploaded.

Index number 19/24779.

the folder structure is like 19/247/xx

so in each subfolder would be 100 documents.

So if one is grabbing dipbt.bundestag.de/dip21/btd/19/… it would/should end up in 24779 pdfs.

The Bundestag gives the numbers continuously and publicly. So all documents should be there (and they are as far as I tested with any given number)

BLUEFROG · December 2, 2020, 7:32pm

Perhaps there is a limiting factor as the Google results return 299 results in the Log.
There is also a hard limit of 1,000 results per plugin returned.

@cgrunenberg woud have to comment on this.

astro · December 2, 2020, 7:38pm

if you tell it to grab
Deutscher Bundestag - Startseite
shouldn’t that work, too?

it started great. (in my settings I said only pdf and 2 levels down the links)

but then it just all of the sudden stopped with 746 files

result was like

so your test file from above: https://dserver.bundestag.de/btd/19/161/1916186.pdf

did not made it into the finals download.

chrillek · December 2, 2020, 8:00pm

Anything in DT’s log? Maybe our parliament tries to prevent too many downloads in one go…

astro · December 2, 2020, 8:17pm

nothing in the log

in the window of the download manager it ends like this

chrillek · December 3, 2020, 7:49am

which is again a 403 …May be you could try to download in batches of 500 or so?

astro · December 3, 2020, 8:04am

That would mean first the subfolders in the range of 000 to 004 then 005 to 010 and so on.

I would gladly do that but how to set that?

I don’t know much about this stuff but isn’t it odd that the Download Manager started at 171 with one file then jumped to 176 grabing two files?

One would expect a more “machine like” proceeding. We know for a fact that each subfolder contain 100 files. Strange, isn’t it?

chrillek · December 3, 2020, 8:55am

Actually, I have no idea. I’ve never used the Download Manager myself. Maybe @BLUEFROG or @cgrunenberg know more about it.

cgrunenberg · December 3, 2020, 9:50am

By default most plugins return only 100 result, this can be customized by creating a custom search set.

chrillek · December 3, 2020, 10:49am

I set the number of connections to 8 in the options panel, told it to only download PDFs and Office documents. That way, it fould more than 1700 and downloaded at least 802 when I stopped it manually (I’m not that eager to fill my disk
From your last screenshot, I’d guess that there’s something weird going on somewhere on the website. “$DirectLink&Strace+localhost” looks like some scripting gone bad. Also, opac.bundestag.de is very different from the other bundestag.de subdomains.
If I enter this last URL in my browser, I get this

which of course is not at all what one wants. I’m not sure how DT got there, but probably by following one of the links on one of the websites …
You might want to try a different approach, maybe using cURL or wget or something similar. Although I can’t promise you that you will not run into a similar problem there… The only viable possibility (to me) seems to mirror the site (yeah, go get yourself an external disk

and then simply collect all the PDFs from your local copy.