How I downloaded all Lotusphere 2009 presentations in 30 minutes
Tags: Lotusphere
Read on to see what tool I used when I downloaded all 179 presentations from Lotusphere 2009 online.
Read on to see what tool I used when I downloaded all 179 presentations from Lotusphere 2009 online.
Last minute update:
Just before I published this blog, I did a rerun on the technique and discovered
that IBM has now altered the pdf.nsf database design. The views in the
database are now hidden, and thus ordinary web spidering doesn't work.
I decide to publish anyway so you know about the technique - so read on
if you are curious! Also, If you still want to download the presentations,
you now need to first collect the urls from the agenda database, and then
grab the pdf files with exact urls. The exact programming to download the
pdfs can be done in many ways, such as using Java agents or other external
tools. I remember that there has been some script-based tools previously.
First of all, you need your Lotusphere 2009 online userid and password to access the online agenda. By examining the urls in the agenda solution you see that url looks like this;
https://www.ls09.info/confapps/pdf.nsf/0/F0EC6E9DB3D731B58525752700781D38/$FILE/AD204.pdf
In other words, the PDFs live in the database pdf.nsf. Below you see the url when I hover over the AD204.pdf link on a page
(Click on the image to see a large one)
I turns out that the PDF database is also available via the http-protocol and not only under https. This is good news for the tool I use, The Teleport Pro from Tenmax. Teleport Pro is a general utility web spider, which costs only $49. No expensive tool for what it do. Below I will walk you through how I created a Teleport Pro project which grabbed all the 179 presentations in approx. 30 minutes
(again, click the image to see a larger one)
In the screenshot above, you see Teleport Pro when it's first opened. Click File -> New Project Wizard to continue. This brings up the following dialog box;
I select Search a website for files of a certain type. Click on the Next button.
I enter the url http://www.ls09.info/confapps/pdf.nsf as the starting point of my spider session. Note that I don't use https but http. You need the much more expensive Teleport Ultra if you really need to spider https! I also tell Teleport Pro that I want to allow the spider to go 10 levels deep. The default 3 levels is probably enough, but just to be sure ... Click Next button...
Now it's time to specify what file types I want to retrieve. Click the Add button in the screen above to see the following drop down;
Select User defined... to define your own;
Note that I specify PDF files by using the pattern *.pdf. Click OK button...
Note how the file type is added. Now it is important to add your Lotusphere 2009 online username and password in the Account and Password fields. Otherwise the spider won't be able to grab the files. Click Next button
And now you are finished! Click Finish button.
You will now select a name for your Teleport Pro project. Note that Teleport also will create a directory with the same name. The grabbed pdf files will be saved to that directory!! Click Save button.
Now you see your project in Teleport Pro. You are ready to start grabbing files by pressing the Start button in the toolbar.
You will now spider the side from the starting url address and retrieve all pdf files in a snap!
As a closing note, you should understand now that this spidering technique can be used to grab files and other info from many other sites as well. Finally, remember to use this technique according to copyrights and so forth.
First of all, you need your Lotusphere 2009 online userid and password to access the online agenda. By examining the urls in the agenda solution you see that url looks like this;
https://www.ls09.info/confapps/pdf.nsf/0/F0EC6E9DB3D731B58525752700781D38/$FILE/AD204.pdf
In other words, the PDFs live in the database pdf.nsf. Below you see the url when I hover over the AD204.pdf link on a page
(Click on the image to see a large one)
I turns out that the PDF database is also available via the http-protocol and not only under https. This is good news for the tool I use, The Teleport Pro from Tenmax. Teleport Pro is a general utility web spider, which costs only $49. No expensive tool for what it do. Below I will walk you through how I created a Teleport Pro project which grabbed all the 179 presentations in approx. 30 minutes
(again, click the image to see a larger one)
In the screenshot above, you see Teleport Pro when it's first opened. Click File -> New Project Wizard to continue. This brings up the following dialog box;
I select Search a website for files of a certain type. Click on the Next button.
I enter the url http://www.ls09.info/confapps/pdf.nsf as the starting point of my spider session. Note that I don't use https but http. You need the much more expensive Teleport Ultra if you really need to spider https! I also tell Teleport Pro that I want to allow the spider to go 10 levels deep. The default 3 levels is probably enough, but just to be sure ... Click Next button...
Now it's time to specify what file types I want to retrieve. Click the Add button in the screen above to see the following drop down;
Select User defined... to define your own;
Note that I specify PDF files by using the pattern *.pdf. Click OK button...
Note how the file type is added. Now it is important to add your Lotusphere 2009 online username and password in the Account and Password fields. Otherwise the spider won't be able to grab the files. Click Next button
And now you are finished! Click Finish button.
You will now select a name for your Teleport Pro project. Note that Teleport also will create a directory with the same name. The grabbed pdf files will be saved to that directory!! Click Save button.
Now you see your project in Teleport Pro. You are ready to start grabbing files by pressing the Start button in the toolbar.
You will now spider the side from the starting url address and retrieve all pdf files in a snap!
As a closing note, you should understand now that this spidering technique can be used to grab files and other info from many other sites as well. Finally, remember to use this technique according to copyrights and so forth.