A script to batch download files using PycURL

When downloading a lot of large files (e.g., remote sensing data), it is difficult to do using a browser as you don’t want to set all the files downloading at once and sitting round waiting for one download to finish and another to start is tedious. Also you might want to download files to somewhere other than your ‘Downloads’ folder.

To make downloading files easier you can use the CURLDownloadFileList.py available on Bitbucket.

The script uses PycURL, which is available to install through conda using:

conda install pycurl

It takes a text file with a list of files to download as input, if you were downloading files using a browser you can right click on the line and select ‘Copy Link Location’ (the exact text will vary depending on the browser you are using) instead of downloading the file. Some files follow a logical pattern so you could use this to get a couple of links and then just copy and paste changing as required (e.g., the number of the file, location, date etc.,).

Once you have the list of files to download the script is run using:

python CURLDownloadFileList.py \
     --filelist ~/Desktop/JAXAFileNamesCut.txt \
     --failslist ~/Desktop/JAXAFileNamesFails.txt \
     --outputpath ~/Desktop/JAXA_2010_PALSAR/ 

Any downloads which fail will be added to the file specified by the ‘–failslist’ argument.

By default the script will check if a file has already been downloaded and won’t download it again, you can skip this check using ‘–nofilecheck’. It is also possible to set time to pause between downloads with ‘–pause’, to avoid rejection from a server when doing big downloads. For all the available options run:

python CURLDownloadFileList.py  --help

As it is running from the command line, you can set it running on one machine (e.g., a desktop at the office) and check on the progress remotely (e.g., from your laptop at home) using ssh. So the script keeps running when you you close the session you can run within GNU Screen, which is installed by default on OS X. To start it type:

screen

then type ctrl+a ctrl+d to detach the session. You can reattach using:

screen -R

to check the progress of your downloads.
Alternatively you can use tmux. However this isn’t available by default.

One thought on “A script to batch download files using PycURL

  1. Osian

    On a similar topic:
    Recently I had to download thousands of tiles from a HTTPS repository. I initially tried using Wget, but encountered problems with the server accepting my username/password. Then I found a Firefox add-on called DownloadThemAll. This has some excellent filtering capabilities which allows you to download only certain file types or file names.

    Its probably not as elegant as your method, but it worked for me.

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s