Tag Archives: linux

Execute commands on multiple computers using GNU Parallel (setting up a cluster on the cheap)

I’ve mentioned before how awesome GNU Parallel[1] is for easily making use of multiple cores on a single machine. You can also use it to run commands on multiple machines if you have SSH access to them, and have set up SSH keys for password-less login (there is a guide to setting up SSH keys here https://www.digitalocean.com/community/tutorials/how-to-set-up-ssh-keys–2

For this example, we’ll assume I have set up SSH keys for three computers under my username ‘dan’. I create a text file ‘nodeslist’ with the IP of each machine and force the number of cores to use on each using the following format:

2/ dan@192.168.0.2
4/ dan@192.168.0.3
4/ dan@192.168.0.4

You can tell parallel to use this file with the ‘–sshloginfile’ flag. As an example we can print out the hostname 4 time:

parallel --sshloginfile nodefile echo "Number {}: Running on \`hostname\`" ::: 1 2 3 4

This will produce an output something like:

Number 1: Running on dan-computer1
Number 4: Running on dan-computer2
Number 3: Running on dan-computer2
Number 2: Running on dan-computer3

Note, the commands won’t necessarily be executed in order.

For a more useful example, we can use gdal_translate to copy an image, keeping only the first band for all files matching ‘*tif’ in the current directory of your local machine. Each file needs to be copied to the remote machine (‘–transfer’) and the output returned (‘–return FILE’). The input and output files are removed after the command has completed (‘–cleanup’).

The total command looks something like:

ls *tif | parallel --sshloginfile nodefile \
     --dry-run \
     --transfer \
     --return {.}_b1.tif \
     --cleanup \
     gdal_translate -of GTiff -b 1 {} {.}_b1.tif

To print the commands, but not run them (to check everything looks OK) the ‘–dry-run’ flag is used. The output should be something like:

gdal_translate -of GTiff -b 1 image1.tif image1_b1.tif
gdal_translate -of GTiff -b 1 image2.tif image2_b1.tif
gdal_translate -of GTiff -b 1 image3.tif image3_b1.tif

The syntax ‘{.}_b1.tif’, takes the name of the input file, removes the extension and appends ‘_b1.tif’ on the end.

Running the command again, without the dry-run flag, will run the commands. The output from GDAL won’t be printed until the command has finished. Once all the commands are complete there will be a ‘*_b1.tif’ copy of every tif in the input directory.

There is an overhead to copying files to a different machines so this is only worthwhile if the commands you want to run are computationally intensive. There isn’t much benefit to using it for ‘gdal_translate’, but it makes a nice simple example to demonstrate the capability.

Further reading

[1] O. Tange (2011): GNU Parallel – The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.

Installing RSGISLib on Windows through a Virtual Machine

Although some efforts have been made to create windows versions of the software in the stack we use, the main platform we’re using and developing on is Linux (or OS X). Therefore, the advice we normally give people wanting to use RSGISLib or SPDLib on Windows is to install Linux in a virtual machine and use this.

Below are the steps needed to get everything working on a virtual machine. All these steps assume a 64-bit version of Windows (as you’ll need to install a 64-bit version of Linux). There is lots on information on the internet which provides more detail on the specifics of setting up a virtual machine and installing Linux, so I’ve just provided an overview of these.

  1. Download and install VirtualBox
  2. Download the free VirtualBox from https://www.virtualbox.org/ and install it.

  3. Download a Linux installation image
  4. There are a number of Linux distributions available. The recommenced distribution is Ubuntu, which is available to download from http://www.ubuntu.com/download. If you’re computer has limited RAM (less than 4 GB), you might want to consider a more lightweight version such as lubunbu

    Whichever distribution you chose make sure you download a 64-bit desktop CD / DVD image. Note: you don’t actually have to burn the image to a CD, the iso file is fine.

  5. Set up a new virtual machine
  6. The VirtualBox documentation has a good step-by-step guide to setting up a virtual machine (http://www.virtualbox.org/manual/ch01.html) and there is lots of documentation available on the internet. Important considerations are to allocate enough space for the virtual machine hard drive, if you select a dynamically allocated hard drive it won’t use the full capacity at once but will be able to grow as you need it. Also give the virtual machine as much RAM as you can spare.

  7. Install Linux
  8. When you first start the virtual machine you will be prompted to install an operating system, select the Linux .iso file you have downloaded and it should boot off this.

    The installation procedure will depend on the Linux distribution you have. Instructions for Ubuntu are here. At some point in the installation it might warn you about erasing all data on the disk, don’t worry about this it’s only the virtual disk you have just created.

  9. Install guest additions
  10. Once Linux has been installed you need to install the ‘Guest additions’, to do this select Devices, ‘Install Guest Additions CD Image’ from the VirtualBox menu, then follow the prompts within Linux.

  11. Setup for copying data
  12. The easiest way to copy data to and from the machine is to enable drag and drop. Once the machine is turned on go to Devices, Drag and Drop, and select ‘Bidirectional’. This will allow you to drag and drop files to the virtual machine. You can also set up a shared folder, there are some details on this here.

  13. Install the software stack
  14. As detailed in earlier posts, we make Linux binaries of the software we develop available through conda. To get everything installed, follow the steps below within the virtual machine:

    • Install conda
    • Download the latest version from here, grab the 64-bit Python 3.5 installer. Once the file has downloaded open a terminal window (in ubuntu, click on the icon in the top left and start typing ‘Terminal’, click on the application which pops up).

      Within a terminal window type the following:

      cd ~/Downloads
      bash Miniconda3-*-Linux-x86_64.sh
      
    • Install software from binstar
    • Open a new terminal window and type the following:

      # Install anaconda client
      conda install anaconda-client
      
      # Install software to separate environment
      conda install anaconda-client
      conda env create au-eoed/au-osgeo-lnx
      
      # Activate environment
      source activate au-osgeo-lnx
      
      # Use environment for all new terminals (optional)
      echo source activate au-osgeo-lnx >> ~/.bashrc
      
    • Test
    • Open a new terminal window, you should be able to type:

      tuiview
      

      To open TuiView

As you’ll be interacting with the software from the command line, you will need to learn some commands. The Software Carpentry shell course (http://swcarpentry.github.io/shell-novice/) is a good starting point. The Linux Command Line book is also recommended.

UNIX Commands I wish I’d known earlier

I’ve been using command line UNIX / Linux for a while but like many people have just picked up bits as and when I’ve needed them. Here are some tips I wish I’d known when I started out.

  1. Use tab to autocomplete.
  2. This is one of the ones that’s really basic but not mentioned in some tutorials. As you start typing a command or file path just hit tab to complete. You just need to type enough letters for the command / path to be identified.

  3. Use Ctrl+r to search previous commands
  4. You can use the up arrow to cycle through previous commands, a slightly cooler trick is typing ctrl+r to search through your command history. Just press and start typing.

  5. Learn to use an editor from the command line (It doesn’t have to be vi!)
  6. If you’re logged into a server using ssh, being able to quickly open a text file and edit a few lines is very handy, and a lot easier than downloading the file, editing and uploading again. A lot of guides use vi, which is amazingly powerful, pretty much universally available but has a really steep learning curve. I love vi (or more accurately vim, which is an updated version) but it took a lot of effort to get to a stage where I was proficient enough for it to be useful. There are much more user friendly alternatives such as nano or ne. Nano is installed with OS X and most Linux distributions, ne will need to be installed but has more features and familiar commands (ctrl+s to save etc.), you can double-tab escape to bring up a menu bar.

  7. Use tmux, screen or byobu to keep sessions running when you log out.
  8. If you’re logged into a machine over ssh and running an interactive proccess it will often stop when you close the ssh connection. Using GNU screen, tmux or byobu, will allow you to detach your session and the process will continue in the background. You can reattach to check the progress. These also allow you to have multiple terminals within the same ssh session.

  9. Think before you press enter.
  10. An obvious one but on a UNIX system with the right commands you can do pretty much anything, this does mean you can do some ridiculously stupid things, especially with the sudo command. Poor use of rm with wild characters has caught me out before (luckily by this time I’d developed somewhat of a paranoid backup system!). You can always use ls before running rm to check which files will be removed.

    If you’re worried about breaking something on your computer learning the command line, you could set up a virtual machine (e.g., using VirtualBox) with linux while your learning and then if anything goes badly wrong you can just delete the machine and start again.

  11. Googling bits as you need them is no substitute for actually sitting down and learning it.
  12. If it looks like your going to be spending a lot of time using the command line of a UNIX / Linux system (and it’s a very useful skill to have), as with anything worth learning you need to invest the time. There are lots of tutorials on the internet and books available, you may find some more suitable than others. My personal favourite is The Linux Command Line by William Shotts, you can download the PDF for free or buy the hard copy, more information is available here

Managing Software & Libraries with EnvMaster

If you’ve compiled software from source you’re probably used to the following sequence:

./configure
make
sudo make install

Which will configure the software to look for libraries and install itself to the default location (normally /usr/local), make and install. As root privileges are required to install to /usr/local, sudo is required.

This is fine unless:

  1. You’re not in the sudo’ers list (e.g., on a shared computer you’re not an administrator for).
  2. You want to have different versions of things (e.g., stable and development versions).

then things start to get complicated. Installing all the software to a single folder you have permission to write to, e.g., ~/software, by passing the –prefix=~/software flag to configure, then adding ~/software/bin to your $PATH and ~/software/lib to $LD_LIBRARY_PATH would solve the first problem but will still cause problems if you want different versions of things.

Ideally you’d install everything into it’s own directory something like:

~/software/
          rsgislib/
                  2.0.0
                  20131019
          gdal/
              1.10.1
              1.9.0

which means you’re going to be spending a lot of time hacking around with environmental variables!

This is where EnvMaster comes in, it allows you to install different versions of software and libraries, in a nicely organised directory structure, and sorts out all the paths for you. When it’s properly set up you can load and swap software / libraries round using:

# Load in RSGISLib
envmaster load rsgislib

# Swap to use the developement version of RSGISLib
envmaster swap rsgislib/2.0.0 rsgislib/20131019

To install EnvMaster clone the source from the EnvMaster Bitbucket page and install:

hg clone https://bitbucket.org/chchrsc/envmaster envmaster

export ENVMASTER_ROOT=~/software/envmaster
python setup.py install --prefix=$ENVMASTER_ROOT

Note: as with the rest of the examples, I’m installing to ~/software (where ~ is your home directory). You can use anywhere you have permission to write, just do a mental find and replace, wherever you see ~/software with the path your using. We normally install things to /share/osgeo/fw/

EnvMaster uses a corresponding text file for each library, these need to be stored in a separate directory (ideally on there own). We’ll make one called modules.

mkdir -p ~/software/modules

Once this is set up, create a text file (lets call it ~/software/setupenvmaster) containing the following:

export PATH=$ENVMASTER_ROOT/bin:$PATH
export PYTHONPATH=$ENVMASTER_ROOT/lib/python2.7/site-packages
# Set up path to modules files
export ENVMASTERPATH=~/software/modules
# Initialize EnvMaster
. $ENVMASTER_ROOT/init/bash

And source it using:

. ~/software/setupenvmaster

If you add this line to .bashrc / .bash_profile envmaster will be available in every new terminal.

Note, if you used a different version of python to install envmaster (e.g., python3.3) you need to change pythonX.X to reflect this.

Now if you type

envmaster avail

You should see ‘~/software/modules’ and ‘No module files found’.

EnvMaster is now all set up and it’s time to start installing things.

Let’s install GDAL, first download from here and untar.

./configure --prefix=/home/dan/software/gdal/1.10.1
make
make install

Note, you probably want to install GDAL with other options (e.g., HDF5), see the RSGISLib documentation for the options recommended if building for RSGISLib.

You then need to set up the files for EnvMaster. Within ~/software/modules make a directory called gdal and create two text files, one called 1.10.1 (the version of gdal) and the other called version.py. In 1.10.1 put the following:

#%EnvMaster1.0

module.setAll('/home/dan/software/gdal/1.10.1')

The first line tells EnvMaster this is an EnvMaster file, module.SetAll() sets environmental variables (PATH etc.,) based on the contents of ~/software/gdal/1.10.1.
In version.py put the following:

#%EnvMaster1.0

version = '1.10.1'

This tells EnvMaster the default version of GDAL is 1.10.1

If you run envmaster avail again it should list gdal. You can load GDAL using:

envmaster load gdal

To unload GDAL (and unset paths) use:

envmaster unload gdal

You can try running gdal_translate to check.
To see the environmental variables GDAL set use:

envmaster disp gdal

As well adding the path to general variables (e.g., PATH), EnvMaster has created variables specific to GDAL (e.g., GDAL_LIB_PATH), these are useful when linking from other software.

To see the envmaster modules you have loaded you can use:

envmaster list

There are loads more options available in EnvMaster than shown here. Whilst it does take a bit of time to set up, it allows you to build a highly organised and very flexible system. The user guide for EnvMaster, in LyX format, is available with the source code from here.

A comprehensive list of instructions for building software with EnvMaster under Linux is available from Landcare (here). Note: These were developed for their system and have very generously been made available. Use at your own risk.
Pete Bunting’s instructions for building under OS X are also available here, these have been tested under OS X 10.9.

View the first lines in a text file

A really simple, but useful, UNIX command is head, which will display the first 5 lines in a text file, great for quickly checking outputs. You can also use the n flag to specify the number of line.

# View first 5 lines
head bigtextfile.csv

# View first 10 lines
head -n 10 bigtextfile.csv

There is also a coresponding tail command to view the last lines.

# View first 5 lines
tail bigtextfile.csv

# View first 10 lines
tail -n 10 bigtextfile.csv

GNU Parallel

GNU Parallel is a utility for executing commands in parallel. It provides a really easy way of  running a command over multiple files and utilising multiple cores, using only a single line.

You can download the latest version from:

http://www.gnu.org/software/parallel/

Or check the package manager for your distro if you’re on linux.

Installation should just be:


./configure
make
sudo make install

There are lots of options and different ways of using parallel, here are a couple of examples:

1. Uncompress and untar all files in a directory.


ls *tar.gz | parallel tar -xf

2. Recursively find RSGISLib XML scripts and run using two cores (-j 2)


find ./ -name 'run_*.xml' | parallel -j 2 rsgisexe -x

3. Find all KEA files and create stats and pyramids using gdalcalcstats from https://bitbucket.org/chchrsc/gdalutils/


ls *kea | gdalcalstats {} -ignore 0

4. Convert KEA (.kea) files to ERDAS imagine format using gdal_translate, removing ‘.kea’ and adding ‘_img.img’


ls *kea | parallel "gdal_translate -of HFA {} {.}_img.img"

# Calculate stats and pyramids after translating
ls *kea | parallel "gdal_translate -of HFA {} {.}_img.img; \

gdalcalcstats {.}_img.img -ignore 0"

Backup with rsync

There are a lot of great tools out there for back up. However, for large remote sensing datasets not all are appropriate. I use the command line tool rsync to backup my data to external hard drives. It only copies data that has changed making it efficient to run.

The command I use is:


rsync -r -u  -p -t --delete --force --progress \

/data/Australia/ /media/Backup1/Australia/

This will recursively search (-r), updating only files that have changed (-u) preserving permissions (-p) and time stamp (-t). Files that are on the back up drive that no longer exist are removed (–delete), including directories (–force). As it can take a while to run, I print the progress to the screen (–progress).

My system is to backup the data on my office computer regularly (weekly or after getting / processing new data) to external hard drives, which I keep at home. To save having to remember the command I have a shell script (backup.sh) in the root directory of the external hard drives. This system works well for me as I my lab has a NAS drive all the data is stored on and all my scripts are stored separate from the data and backed up a lot more regularly.

If you leave your external hard drive connected to your computer you can create a cron job to run your backup script at regular intervals. To create a cron job use:


crontab -e

This will open a text file for editing, the comments (lines beginning with #) explaining the format. To backup up at 4 pm every day, add the following line.


0 16 * * * sh /media/Backup1/backup.sh

Remembering to change path of the backup script. You can change the time (second number) as needed. You can duplicate the line and set a different time to run twice (e.g., in the morning and afternoon).

There are many options for rsync so you can customise the backup to suit your requirements. To see these options type:


rsync --help

As rsync only updates files that have changed and preserves the time stamp, I also use it if I have a folder with lots of large files to copy. Then, if the copy gets interrupted it can be resumed.