Tag Archives: GNU Parallel

Execute commands on multiple computers using GNU Parallel (setting up a cluster on the cheap)

I’ve mentioned before how awesome GNU Parallel[1] is for easily making use of multiple cores on a single machine. You can also use it to run commands on multiple machines if you have SSH access to them, and have set up SSH keys for password-less login (there is a guide to setting up SSH keys here https://www.digitalocean.com/community/tutorials/how-to-set-up-ssh-keys–2

For this example, we’ll assume I have set up SSH keys for three computers under my username ‘dan’. I create a text file ‘nodeslist’ with the IP of each machine and force the number of cores to use on each using the following format:

2/ dan@192.168.0.2
4/ dan@192.168.0.3
4/ dan@192.168.0.4

You can tell parallel to use this file with the ‘–sshloginfile’ flag. As an example we can print out the hostname 4 time:

parallel --sshloginfile nodefile echo "Number {}: Running on \`hostname\`" ::: 1 2 3 4

This will produce an output something like:

Number 1: Running on dan-computer1
Number 4: Running on dan-computer2
Number 3: Running on dan-computer2
Number 2: Running on dan-computer3

Note, the commands won’t necessarily be executed in order.

For a more useful example, we can use gdal_translate to copy an image, keeping only the first band for all files matching ‘*tif’ in the current directory of your local machine. Each file needs to be copied to the remote machine (‘–transfer’) and the output returned (‘–return FILE’). The input and output files are removed after the command has completed (‘–cleanup’).

The total command looks something like:

ls *tif | parallel --sshloginfile nodefile \
     --dry-run \
     --transfer \
     --return {.}_b1.tif \
     --cleanup \
     gdal_translate -of GTiff -b 1 {} {.}_b1.tif

To print the commands, but not run them (to check everything looks OK) the ‘–dry-run’ flag is used. The output should be something like:

gdal_translate -of GTiff -b 1 image1.tif image1_b1.tif
gdal_translate -of GTiff -b 1 image2.tif image2_b1.tif
gdal_translate -of GTiff -b 1 image3.tif image3_b1.tif

The syntax ‘{.}_b1.tif’, takes the name of the input file, removes the extension and appends ‘_b1.tif’ on the end.

Running the command again, without the dry-run flag, will run the commands. The output from GDAL won’t be printed until the command has finished. Once all the commands are complete there will be a ‘*_b1.tif’ copy of every tif in the input directory.

There is an overhead to copying files to a different machines so this is only worthwhile if the commands you want to run are computationally intensive. There isn’t much benefit to using it for ‘gdal_translate’, but it makes a nice simple example to demonstrate the capability.

Further reading

[1] O. Tange (2011): GNU Parallel – The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.

Batch processing atmospheric correction using ARCSI

Previous posts have covered atmospheric correction using the Atmospheric and Radiometric Correction of Satellite Imagery (ARCSI) software. When a large number of scenes require processing then commands to automate steps are desirable and greatly simplify the process.

ARCSI provides a number of commands which are useful for batch processing:

  • arcsisortlandsat.py sorts the landsat ‘tar.gz’ files into individual directories for each sensor (i.e., LS 5 TM) and also builds a standard directory structure for processing.
  • arcsiextractdata.py extracts the contents of \verb|tar| and \verb|tar.gz| archives with each archive being extracted into an individual directory.
  • arcsibuildcmdslist.py builds the ‘arcsi.py’ commands for each scene creating a shell script.

The files used for this tutorial are shown below, but others could be downloaded from from earthexplorer and used instead.

LC80720882013174LGN00.tar.gz	LT52040232011310KIS00.tar.gz
LC82040242013139LGN01.tar.gz	LT52040242011118KIS00.tar.gz
LT52040232011118KIS00.tar.gz	LT52050232011109KIS00.tar.gz

Sort Landsat Scenes

The first command to execute is ‘arcsisortlandsat.py’ which needs to be executed from within the directory containing files:

arcsisortlandsat.py -i . -o .

This will generate the following directory structure and move the files into the appropriate directories.

> ls *
LS5:
Inputs	Outputs	RAW	tmp

LS8:
Inputs	Outputs	RAW	tmp

> ls */RAW
LS5/RAW:
LT52040232011118KIS00.tar.gz	LT52040242011118KIS00.tar.gz
LT52040232011310KIS00.tar.gz	LT52050232011109KIS00.tar.gz

LS8/RAW:
LC82040242013139LGN01.tar.gz	LC82060242015047LGN00.tar.gz
LC82060232015047LGN00.tar.gz

Extract Data

To extract the data from the ‘tar.gz’ files the ‘arcsiextractdata.py’ command is used as shown below:

arcsiextractdata.py -i ./LS5/RAW/ -o ./LS5/Inputs/

arcsiextractdata.py -i ./LS8/RAW/ -o ./LS8/Inputs/

Once the files are extracted the directory structure will look like the following:

> ls */RAW
LS5/RAW:
LT52040232011118KIS00.tar.gz	LT52040242011118KIS00.tar.gz
LT52040232011310KIS00.tar.gz	LT52050232011109KIS00.tar.gz

LS8/RAW:
LC80720882013174LGN00.tar.gz	LC82040242013139LGN01.tar.gz

> ls */Inputs
LS5/Inputs:
LT52040232011118KIS00	LT52040242011118KIS00
LT52040232011310KIS00	LT52050232011109KIS00

LS8/Inputs:
LC82040242013139LGN01	LC82060242015047LGN00
LC82060232015047LGN00

Build ARCSI Commands

To build the ‘arcsi.py’ commands, one for each input file, the following commands are used. Notice that these are very similar to the individual commands that you previously executed but now provide inputs to the ‘arcsibuildcmdslist.py’ command which selects a number of input files and generate a single shell script output.

arcsibuildcmdslist.py -s ls5tm -f KEA --stats -p RAD DOSAOTSGL SREF \
          --outpath ./LS5/Outputs --aeroimg ../WorldAerosolParams.kea \
          --atmosimg ../WorldAtmosphereParams.kea --dem ../UKSRTM_90m.kea \
          --tmpath ./LS5/tmp --minaot 0.05 --maxaot 0.6 --simpledos \
          -i ./LS5/Inputs -e MTL.txt -o LS5ARCSICmds.sh

arcsibuildcmdslist.py -s ls8 -f KEA --stats -p RAD DOSAOTSGL SREF \
        --outpath ./LS8/Outputs --aeroimg ../WorldAerosolParams.kea \
        --atmosimg ../WorldAtmosphereParams.kea --dem ../UKSRTM_90m.kea \
        --tmpath ./LS8/tmp --minaot 0.05 --maxaot 0.6 --simpledos \
        -i ./LS8/Inputs -e MTL.txt -o LS8ARCSICmds.sh

Following the execution of these commands the following two files will have been created ‘LS5ARCSICmds.sh’ and ‘LS8ARCSICmds.sh’. These files contain the ‘arcsi.py’ commands to be executed. Open the files and take a look, you will notice that all the file paths have been convert to absolute paths which means the file can be executed from anywhere on the system as long as the input files are not moved.

Execute ARCSI Commands

To execute the ‘arcsi.py’ commands the easiest methods is to run each in turn using the following command:

sh LS5ARCSICmds.sh

sh LS8ARCSICmds.sh

This will run each of the commands sequentially. However, most computers now have multiple processing cores and to take advantage of those cores we can use the GNU parallel command line tool. Taking advantage of those cores means that processing can be completed much quicker and more efficiently.

cat LS5ARCSICmds.sh LS8ARCSICmds.sh | parallel -j 2 

The switch ‘-j 2’ specifies the number of processing cores which can be used for this processing. If no switches are provided then all the cores will be used. As some parts of the process involve reading / writing large files, depending on the speed of you hard drive and the number of cores you have available you might find limiting the number of cores is needed. Please note that until all processing has completed nothing will be printed to the console. You can check the command are running using the top / htop command.

Once you have completed your processing you should clean up your system to remove any files you don’t need for any later processing steps. In most cases you will only need to keep the ‘tar.gz’ (so you can reprocess the RAW data at a later date if required) and the SREF product. It is recommended that you also retain the scripts you used for processing and a record of the commands you used for a) reference if you need to rerun the data and b) as documentation for the datasets so you know what parameters and options were used.

This post was adapted from notes for a module taught as part of the Masters course in GIS and Remote Sensing at Aberystwyth University, you can read more about the program here.

GNU Parallel

GNU Parallel is a utility for executing commands in parallel. It provides a really easy way of  running a command over multiple files and utilising multiple cores, using only a single line.

You can download the latest version from:

http://www.gnu.org/software/parallel/

Or check the package manager for your distro if you’re on linux.

Installation should just be:


./configure
make
sudo make install

There are lots of options and different ways of using parallel, here are a couple of examples:

1. Uncompress and untar all files in a directory.


ls *tar.gz | parallel tar -xf

2. Recursively find RSGISLib XML scripts and run using two cores (-j 2)


find ./ -name 'run_*.xml' | parallel -j 2 rsgisexe -x

3. Find all KEA files and create stats and pyramids using gdalcalcstats from https://bitbucket.org/chchrsc/gdalutils/


ls *kea | gdalcalstats {} -ignore 0

4. Convert KEA (.kea) files to ERDAS imagine format using gdal_translate, removing ‘.kea’ and adding ‘_img.img’


ls *kea | parallel "gdal_translate -of HFA {} {.}_img.img"

# Calculate stats and pyramids after translating
ls *kea | parallel "gdal_translate -of HFA {} {.}_img.img; \

gdalcalcstats {.}_img.img -ignore 0"