Tag Archives: Parallel Processing

Execute commands on multiple computers using GNU Parallel (setting up a cluster on the cheap)

I’ve mentioned before how awesome GNU Parallel[1] is for easily making use of multiple cores on a single machine. You can also use it to run commands on multiple machines if you have SSH access to them, and have set up SSH keys for password-less login (there is a guide to setting up SSH keys here https://www.digitalocean.com/community/tutorials/how-to-set-up-ssh-keys–2

For this example, we’ll assume I have set up SSH keys for three computers under my username ‘dan’. I create a text file ‘nodeslist’ with the IP of each machine and force the number of cores to use on each using the following format:

2/ dan@192.168.0.2
4/ dan@192.168.0.3
4/ dan@192.168.0.4

You can tell parallel to use this file with the ‘–sshloginfile’ flag. As an example we can print out the hostname 4 time:

parallel --sshloginfile nodefile echo "Number {}: Running on \`hostname\`" ::: 1 2 3 4

This will produce an output something like:

Number 1: Running on dan-computer1
Number 4: Running on dan-computer2
Number 3: Running on dan-computer2
Number 2: Running on dan-computer3

Note, the commands won’t necessarily be executed in order.

For a more useful example, we can use gdal_translate to copy an image, keeping only the first band for all files matching ‘*tif’ in the current directory of your local machine. Each file needs to be copied to the remote machine (‘–transfer’) and the output returned (‘–return FILE’). The input and output files are removed after the command has completed (‘–cleanup’).

The total command looks something like:

ls *tif | parallel --sshloginfile nodefile \
     --dry-run \
     --transfer \
     --return {.}_b1.tif \
     --cleanup \
     gdal_translate -of GTiff -b 1 {} {.}_b1.tif

To print the commands, but not run them (to check everything looks OK) the ‘–dry-run’ flag is used. The output should be something like:

gdal_translate -of GTiff -b 1 image1.tif image1_b1.tif
gdal_translate -of GTiff -b 1 image2.tif image2_b1.tif
gdal_translate -of GTiff -b 1 image3.tif image3_b1.tif

The syntax ‘{.}_b1.tif’, takes the name of the input file, removes the extension and appends ‘_b1.tif’ on the end.

Running the command again, without the dry-run flag, will run the commands. The output from GDAL won’t be printed until the command has finished. Once all the commands are complete there will be a ‘*_b1.tif’ copy of every tif in the input directory.

There is an overhead to copying files to a different machines so this is only worthwhile if the commands you want to run are computationally intensive. There isn’t much benefit to using it for ‘gdal_translate’, but it makes a nice simple example to demonstrate the capability.

Further reading

[1] O. Tange (2011): GNU Parallel – The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.

Multi-core processing with RSGISLib using tiles

Within RSGISLib there are functions within imageutils for tiling and mosaicking images. These can be combined to split large datasets up for processing using multiple cores or nodes on a HPC. The Python bindings for the createTiles function was written so it returns a list of all tiles created, this allows the list of tiles to be passed to a separate function for processing. Combining with the multiprocessing module in Python provides a simple way of processing large datasets on multiple cores with the createImageMosaic function called at the end to join the tiles back together.

The example below shows how to:

  1. Split an image into temporary tiles (in a directory created using the tempfile module).
  2. Run an RSGISLib function on each tile, in this case imageMath.
  3. Re-mosaic the tiles.
  4. Remove the temp files created.
# Import RSGISLib
import rsgislib
from rsgislib import imageutils
from rsgislib import imagecalc

# Import multiprocessing
from multiprocessing import Pool
# Import tempfile
import tempfile
# Import os
import os

outFormat = 'KEA'
outType = rsgislib.TYPE_32INT

def addOne(inImage):

    outputImage = inImage.replace('.kea','add1.kea')
    expression = 'b1+1'
    imagecalc.imageMath(inImage, outputImage,
                    expression, outFormat, outType)

    return outputImage

if __name__ == '__main__':

    inputImage = 'N06W053_PALSAR_08_HH_utm.kea'
    outImage = 'N06W053_PALSAR_08_HH_utm_addOne.kea'

    # Create temporary directory
    tempDIR = tempfile.mkdtemp(dir='.')
    outBase = os.path.join(tempDIR, 'tile')

    # Create Tiles
    width = 1000
    height = width
    overlap = 5 # Set a 5 pixel overlap between tiles
    offsettiling = 0
    ext='kea'
    temptiles = imageutils.createTiles(inputImage, outBase, width,
             height, overlap, offsettiling, outFormat, outType, ext)

    # Run process on tiles
    pool = Pool()
    temptilesP = pool.map(addOne, temptiles)

    # Mosaic tiles
    backgroundVal = 0.
    skipVal = 0.
    skipBand = 1
    overlapBehaviour = 0

    imageutils.createImageMosaic(temptilesP, outImage, backgroundVal,
               skipVal, skipBand, overlapBehaviour, outFormat, outType)

    # Remove temp tiles and DIR
    removeList = temptiles + temptilesP
    for tile in removeList:
        os.remove(tile)

    os.removedirs(tempDIR)

As all the functions try to write to stdout at the same time it looks a little messy (I still need to figure out a nice way to improve this). Also, although I didn’t run any tests the overhead of creating and mosaicking the tiles will likely make this image maths function (which is not particularly CPU intensive) take longer than running on the entire image in one go. However, it shows the potential for combining the RSGISLib Python bindings with the multiprocessing module.

GNU Parallel

GNU Parallel is a utility for executing commands in parallel. It provides a really easy way of  running a command over multiple files and utilising multiple cores, using only a single line.

You can download the latest version from:

http://www.gnu.org/software/parallel/

Or check the package manager for your distro if you’re on linux.

Installation should just be:


./configure
make
sudo make install

There are lots of options and different ways of using parallel, here are a couple of examples:

1. Uncompress and untar all files in a directory.


ls *tar.gz | parallel tar -xf

2. Recursively find RSGISLib XML scripts and run using two cores (-j 2)


find ./ -name 'run_*.xml' | parallel -j 2 rsgisexe -x

3. Find all KEA files and create stats and pyramids using gdalcalcstats from https://bitbucket.org/chchrsc/gdalutils/


ls *kea | gdalcalstats {} -ignore 0

4. Convert KEA (.kea) files to ERDAS imagine format using gdal_translate, removing ‘.kea’ and adding ‘_img.img’


ls *kea | parallel "gdal_translate -of HFA {} {.}_img.img"

# Calculate stats and pyramids after translating
ls *kea | parallel "gdal_translate -of HFA {} {.}_img.img; \

gdalcalcstats {.}_img.img -ignore 0"