Tag Archives: Python

Converting a Jupyter Notebook to a Python Script

Jupyter notebooks are great for developing a method but what about if you need to run the method for lots of files? This post describe how to go from a notebook to a script which you can run on the command line and apply to multiple files.

For this example we’ll be using a notebook written to calculate the Floating Debris Index (FDI; [1]) and apply to a Sentinel 2 netCDF file processed using ACOLITE. If you are interested in running the notebook a tutorial on using ACOLITE is available from a NEODAAS training course. You can download ACOLITE and the manual from here. The xarray library is used to read the netCDF file, calculate FDI and write back out to a netCDF file.

[1] Biermann, L., Clewley, D., Martinez-Vicente, V. et al. Finding Plastic Patches in Coastal Waters using Optical Satellite Data. Sci Rep 10, 5364 (2020). https://doi.org/10.1038/s41598-020-62298-z

1. Tidy up notebook

The first step is to tidy up the notebook so it only contains the code that is required. This means removing cells which print out lots of data or produce plots (unless plots are an output of the notebook).

2. Export as Python

From File -> Export Notebook As… in the Jupyter Lab menu select ‘Export Notebook to Executable Script’. This will give you a Python script to download.

If you copy the script to the same directory as the notebook and either open a terminal / command prompt and navigate to this directory or select ‘File’ -> ‘New’ -> Terminal in Jupter Lab to open a terminal within your browser you can run the script using python SCRIPT_NAME so for this example:

python s2_FDI_Calculation.py

This allows running the script outside a notebook. However, it will only run the netCDF file which has been ‘hard coded’ in the script, we want to be able to specify the input file on the command line.

3. Allow specifying input file using argparse

The input and output file are specified in the lines:

s2_data = xarray.open_dataset("S2A_MSI_2018_04_20_11_21_21_T30VWH_L2R.nc", decode_cf=False, chunks=512)


fdi_xarray.to_netcdf("S2A_MSI_2018_04_20_11_21_21_T30VWH_L2R_fdi.nc", engine="netcdf4", encoding={"fdi" : {"zlib" : True, "complevel" : 4}})

Using argparse it is possible to specify file names (any many other parameters) from the command line.

For example:

import argparse
parser = argparse.ArgumentParser(description="Calculate Floating Debris Index (FDI) Biermann "
                                                " to a S2 scene")
parser.add_argument("innetcdf", help="Input netCDF file")
parser.add_argument("outnetcdf", help="Output netCDF file")
args = parser.parse_args()

Adding this at the start of the script and changing the lines

s2_data = xarray.open_dataset(args.innetcdf, decode_cf=False, chunks=512)


fdi_xarray.to_netcdf(args.outnetcdf engine="netcdf4", encoding={"fdi" : {"zlib" : True, "complevel" : 4}})

Trying to run the script as before using python s2_FDI_Calculation.py will now print the following message:

usage: s2_FDI_Calculation.py [-h] innetcdf outnetcdf
s2_FDI_Calculation.py: error: the following arguments are required: innetcdf, outnetcdf

So argparse not only allows taking arguments from the command line it also provides help if the correct arguments aren’t provided. The script can now be run using:

python s2_FDI_Calculation.py S2A_MSI_2018_04_20_11_21_21_T30VWH_L2R.nc S2A_MSI_2018_04_20_11_21_21_T30VWH_L2R_fdi.nc

Providing a different input and output file will allow running for different files. However, an improvement would be allow the script to take multiple input files and run for all of them rather than having to run the script for each file.

4. Tidy up script to use functions

To make it easier to run multiple files the code which calculates FDI will be split into a separate function:

def calculate_fdi_from_netcdf(input_netcdf, output_netcdf):
    Function to calculate FDI from a netCDF file
    # Specify chunksize so uses dask and doesn't load all data to RAM
    s2_data = xarray.open_dataset(input_netcdf, decode_cf=False, chunks=512)
    # Export
    print("Writing out")
    with ProgressBar():
        fdi_xarray.to_netcdf(output_netcdf, engine="netcdf4", encoding={"fdi" : {"zlib" : True, "complevel" : 4}})

The argparse lines are them moved to the bottom of the script:

if __name__ == "__main__":

    parser = argparse.ArgumentParser(description="Calculate Floating Debris Index (FDI) Biermann "
                                                    " to a S2 scene")
    parser.add_argument("innetcdf", help="Input netCDF file")
    parser.add_argument("outnetcdf", help="Output netCDF file")
    args = parser.parse_args()

    calculate_fdi_from_netcdf(args.innetcdf, args.outnetcdf)

This will only run if the the file is run as a script, allowing the function to be used in another Python script. For more details see here.

As part of the tidying up process comments with cell numbers from the notebook (e.g., # In[3]:) can be removed.

This will run the same as before, so a single input and output file is provided. The next step is to take a number of input files, an output directory and run for all of them:

import os

if __name__ == "__main__":

    parser = argparse.ArgumentParser(description="Calculate Floating Debris Index (FDI) Biermann "
                                                    " to a S2 scene")
    parser.add_argument("innetcdfs", nargs="+", help="Input netCDF file (s)")
    parser.add_argument("-o", "--outdir", help="Output directory", required=True)
    args = parser.parse_args()

    for in_netcdf in args.innetcdfs:
        print(f"Calculating FDI for {in_netcdf}...")
        # Set up output name for FDI netCDF
        out_netcdf_basename = os.path.basename(in_netcdf).replace(".nc", "_fdi.nc")
        out_netcdf = os.path.join(args.outdir, out_netcdf_basename)
        # Calculate FDI
        calculate_fdi_from_netcdf(in_netcdf, out_netcdf)
        print(f"Saved to {out_netcdf}")

This time rather than specifying an output netCDF an output directory is used (specified with -o or --outdir when running the script) and the os module is used to build a path using the input file.

The script can now be run using:

python s2_FDI_Calculation.py S2A_MSI_2018_04_20_11_21_21_T30VWH_L2R.nc S2A_MSI_2018_06_07_08_56_01_T35SMD_L2R.nc -o .

To calculate from two netCDF files and save to the current directory (specified with .). On macOS and Linux can use globs to match a list of files rather than specifying, for example:

python s2_FDI_Calculation.py S2A_MSI_*_L2R.nc -o .

The original notebook and finished Python script are available from: https://github.com/NEODAAS/example_code/tree/main/jupyter-notebook-to-script

A script to find and download GEDI passes

The Global Ecosystem Dynamics Investigation (GEDI) is a spaceborne lidar instrument mounted on the International Space Station (ISS). The GEDI instrument is a geodetic-class lidar with 3 lasers that produce 8 parallel tracks. Each laser illuminates a 25 m footprint on the ground and fires at a rate of 242 times per second. Each footprint is separated by 60 m in the along track direction, at an across-track distance of 600 m between each of the 8 tracks. GEDI’s precise measurements of forest canopy height, canopy vertical structure, and surface elevation will be critical to characterizing and understanding carbon and water cycling processes to further our knowledge of the world we live in: http://www.gedi.umd.edu.

GEDI is mounted on the ISS and therefore its orbit is dictated by that of the ISS. This prevents repeat acquisitions in the same way that Landsat or Sentinel satellite data is collected. While this enables wider coverage, its data acquisition is not consistent. Further to this, GEDI has the ability to point its lasers so that the area on the ground imaged is not necessarily that at nadir beneath the sensor.

To date, GEDI data is available via two means:

Earthdata (https://search.earthdata.nasa.gov/search) provides the data but does not currently provide sufficient visualization so that users can see whether the data intersects their ROI.

Alternatively, the NASA LP DAAC provide the data in list/ftp format (https://e4ftl01.cr.usgs.gov/GEDI/). This allows easy access to all the data but requires that the user knows the orbit that they require. Since February 2020, this has been supplemented by a web tool called GEDI Finder (https://lpdaacsvc.cr.usgs.gov/services/gedifinder). Users pass in bounding box dimensions alongside the GEDI product and version they require and it returns a subset of the list of data that intersects their bounding box. While this is a simple method for accessing data, it requires that a user either manually downloads each one or passes each name to software (wget, curl) to pull the data.

SearchPullGEDI.py was written to make the best of these existing tools and enable the automation of the whole search and download process. It relies heavily on the GEDI Finder tool to search for the data that intersects a bounding box and is, in its simplest form, a wrapper for this tool. It also allows an optional date range to be specified if users are only interested in data collected during a specific time period. SearchPullGEDI.py requires a number of command line options that include GEDI product, version, bounding box, output path and Earthdata login credentials. An overview and example of SearchPullGEDI.py is:

python SearchPullGedi.py -p GEDI02_B -v 001 \
-bb 0.3714633 9.277247 -0.08164874 10.00922 \
-d 2019-01-01 2019-04-30 \
-o /Users/Me/Data/Gedi \
-u MyEarthDataUsername -pw MyEarthDataPassword


-p is the GEDI product (e.g., GEDI02_B)

-v is the GEDI version (e.g., 001)

-bb is the bounding box given in UpperLeftLon (maxY), UpperLeftLat (minx), LowerRightLon (minY) and LowerRightLat (maxX). These should be passed to the terminal separated by spaces.

-d is the date range specified in the format YYYY-MM-DD with the start and end data separated with a space

-o is the local path where you want to download the data

-u is your EarthData login username (Note an EarthData account is required)

-pw is your EarthData login password


python SearchPullData.py -h

will also provide this information.

SearchPullGedi.py contains 3 functions.

  • The first constructs the command line options and passes it to the existing GEDI Finder tool and pulls a list of the GEDI files that are on the html webpage.
  • The second function parses the search results pulled from the GEDI Finder URL and constructs a list where each element is a separate GEDI H5 file.
  • The final function iterates over the list of H5 files and pulls each one using wget software.

SearchPullGEDI.py is available here: https://bitbucket.org/nathanmthomas/bucket-of-rs-and-gis-scripts/src/master/SearchPullGEDI.py

SearchPullGEDI.py requires wget is installed.  On Linux this can be installed through the package manager if not already installed, on macOS you can install this from conda-forge using:

conda create -n gedi -c conda-forge python wget

The command will print the number of files found then start downloading them. Each file is approximatly 1 GB.

Author: Nathan Thomas (@DrNASApants)

Nathan is a UMD Earth System Science Interdisciplinary Center (ESSIC) PostDoc positioned at the NASA Goddard Space Flight Center. Nathan’s research is focused primarily around land cover mapping and characterizing the above ground structure of vegetation, particularly mangrove forests. Through this he uses python to pull, preprocess, calibrate, analyze and display remote sensing info. Some of this code is distributed through his bitbucket: https://bitbucket.org/nathanmthomas/bucket-of-rs-and-gis-scripts/src/master/

Find scenes from an archive which contain a point

This is a quick way of locating which scenes, from an archive of data, contain a point you are interested in.

First make a list of all available scenes

ls data/S1/2017/??/??/*/*vv*img > all_scenes.txt

Then use gdalbuildvrt to make a VRT file containing all scenes as a separate band (assumes scenes only have a single band).

gdalbuildvrt -input_file_list all_scenes.txt \
             -separate -o s1_all_scenes.vrt

Use gdallocation info with ‘-lifonly’ flag to the scene the point we are interested in (1.5 N, 103 E) is within and redirect to a text file.

gdallocationinfo -lifonly -wgs84 s1_all_scenes.vrt \
                 103 1.5 > selected_scenes.txt

This will show if within the bounding box of the scene. To find scenes which have data can use gdallocationinfo again but with the ‘-valonly’ flag and save the values to a text file.

gdallocationinfo -valonly -wgs84 s1_all_scenes.vrt \
                  103 1.5 > selected_scenes_values.txt

Can then subset the original list of files with the values file using Python:

import numpy
# Read in data to numpy arrays
scenes = numpy.genfromtxt("selected_scenes.txt", dtype=str)
values = numpy.genfromtxt("selected_scenes_values.txt")

# Select only scenes where the value is not 0
scenes_data = scenes[values != 0]

# Save to text file
numpy.savetxt("selected_scenes_data.txt", scenes_data, fmt="%s")

If you have a very large archive of data and need to find which scenes intersect a point often, having a spatial database with the scene outlines would be a better approach. However, if it isn’t something you do often this quick approach using only the GDAL utilities and a bit of Python is worth knowing.

Convert Sentinel-2 Data Using GDAL

The latest version of GDAL (2.1) has a driver to read Sentinel 2 data (see http://gdal.org/frmt_sentinel2.html). Like HDF files they are read as subdatasets. Running ‘gdalinfo’ on the zipped folder or the .xml file contained within the .SAFE directory will display all the subdatasets, as well as all the metadata (so quite a lot of information!).

As with HDF5 you can pass the subdataset names to gdalinfo to get more information or gdal_translate to extract them as a separate dataset.

To make it easier to extract all the subdatasets I wrote a script (extract_s2_data.py) which can be downloaded from the https://bitbucket.org/petebunting/rsgis_scripts repository.

The script will get a list of all subdatasets using GDAL:

from osgeo import gdal
dataset = gdal.Open('S2/S2.xml', gdal.GA_ReadOnly)
subdatasets = dataset.GetSubDatasets()
dataset = None

The ones to be extracted are for the 10, 20 and 60 m resolution band groups for each UTM zone (if the file crosses multiple zones).

For each subdataset it will give an output name, replacing the EPSG code with the UTM zone and ‘:’ with ‘_’.

Then the gdal_translate command is used to create a new file for each. By default the output is KEA format, called using subprocess.

To run the script first install GDAL 2.1, the conda-forge channel has recent builds, to install them using conda:

conda create -n gdal2 -c conda-forge gdal
source activate gdal2

(If you are on Windows leave out ‘source’)

To extract all subdatasets from a zipped Sentinel 2 scene to the current directory you can then use:

extract_s2_data.py -o . \

The gdal_translate command used is printed to the screen.

The default output format is KEA, you can change using the ‘–of’ flag. For example to convert an unzipped scene to GeoTiff:

extract_s2_data.py -o . --of GTiff \

To get the extension for all supported drivers, and some creation options the ‘get_gdal_drivers’ module from arsf_dem_scripts is optionally used. You can just download this file and copy into the same directory ‘extract_s2_data.py’ has been saved to. For Linux or OS X you can run:

# OS X
curl https://raw.githubusercontent.com/pmlrsg/arsf_dem_scripts/master/arsf_dem/get_gdal_drivers.py > get_gdal_drivers.py

# Linux
wget https://raw.githubusercontent.com/pmlrsg/arsf_dem_scripts/master/arsf_dem/get_gdal_drivers.py

Create a CSV with the coordinates from geotagged photos

Recently I took a lot of photos on my phone during fieldwork (survey for the NERC-ARF LiDAR and hyperspectral calibration flights) and wanted to extract the GPS coordinates from each photo so I could load them into QGIS using the add delimited text dialogue. The aim wasn’t to have precise locations for each photo (GPS positions for each of the points surveyed were recorded separately) but to give a quick idea of the locations we’d visited before the main GPS data were processed.

I while ago (2012!) I wrote a script for this task. The script (CreateJPEGKMZ) requires pillow and the imagemagick command line tools.

It code isn’t particularly tidy (despite my recent updates) but it basically pulls out the GPS location from the EXIF tags and writes this to a CSV file. To create a KMZ a thumbnail image is created and a corresponding KML file. Both the thumbnail and KML are then zipped together to generate the KMZ which can be opened in GoogleEarth.

To use the script to write a CSV file:

CreateJPEGKMZ.py --outcsv gps_points.csv input_jpeg_files

To also create KMZ files for each photo:

CreateJPEGKMZ.py --outcsv gps_points.csv \
                 --outkmz output_kmz_files 

This was also a good lesson in the benefits of having scripts in version control and publicly available.

Working with hyperspectral data in ENVI BIL format using Python

For working with ENVI files I normally use GDAL as code can then be applied to different formats. However, there are a couple of limitations with GDAL when working with hyperspectral data in ENVI format:

  1. GDAL doesn’t copy every item from the header file to a new header file if they don’t fit in with the GDAL data model. Examples are FWHM and comments. Sometimes extra attributes are copied to the aux.xml file GDAL creates, these files aren’t read by ENVI or other programs based on IDL (e.g., ATCOR).
  2. For data stored Band Interleaved by Line (BIL) rather than Band Sequential (BSQ) reading and writing a band at a time is inefficient as it is necessary to keep jumping around the file

To overcome these issues NERC-ARF-DAN use their own Python functions for reading / writing header files and loading BIL files a line at a time. These functions have been tidied up and released through the NERC-ARF Tools repository on GitHub (https://github.com/pmlrsg/arsf_tools). The functions depend on NumPy.

To install them it is recommended to use the following steps:

  1. Download miniconda from http://conda.pydata.org/miniconda.html#miniconda and follow the instructions to install.
  2. Open a ‘Command Prompt’ / ‘Terminal’ window and install numpy by typing:
    conda install numpy
  3. Download ‘arsf_tools’ from GitHub (https://github.com/pmlrsg/arsf_tools/archive/master.zip)
  4. Unzip and within a ‘Command Prompt’ or ‘Terminal’ window navigate to the the folder using (for example):
    cd Downloads\arsf_tools-master
  5. Install the tools and library by typing:
    python setup.py install

Note, if you are using Linux you can install the arsf_binary reader from https://github.com/arsf/arsf_binaryreader which is written in C++. The ‘arsf_envi_reader’ module will import this if available as it is faster than the standard NumPy BIL reader.

If you are a UK based researcher with access to the JASMIN system the library is already installed and can be loaded using:

module load contrib/arsf/arsf_tools

An simple example of reading each line of a file, adding 1 to every band and writing back out again is:

from arsf_envi_reader import numpy_bin_reader
from arsf_envi_reader import envi_header

# Open file for output
out_file = open("out_file.bil", "w")

# Open input file
in_data = numpy_bin_reader.BilReader("in_file.bil")

for line in in_data:
out_line = line + 1

# Copy header

# Close files
in_data = None

A more advanced example is applying a Savitzky-Golay filter to each pixel. As the filter requires every band for each pixel it is efficient to work with BIL files.

For developing algorithms using spatial data, in particular multiple files it is recommended to convert the files to another band-sequential format using the ‘gdal_translate’ so they can be read using programs which use GDAL, for example RIOS or RSGISLib.

To convert files from ENVI BIL to ENVI BSQ and copy all header attributes the ‘envi_header’ module can be used after converting the interleave with GDAL. For example:

import os
import subprocess
from arsf_envi_reader import envi_header

# Set input and output image (could get with argparse)
input_image = "input.bil"
output_image = "output_bsq.bsq"

# Get interleave from file extension
output_interleave = os.path.splitext(output_image)[-1].lstrip(".")

# 1. Convert interleave
print("Converting interleave to {}".format(output_interleave.upper()))
gdal_translate_cmd = ["gdal_translate",
                      "-of", "ENVI",
                      "-co", "INTERLEAVE={}".format(output_interleave.upper())]
gdal_translate_cmd.extend([input_image, output_image])

# 2. Copy header (GDAL doesn't copy all items)
# Find header files
input_header = envi_header.find_hdr_file(input_image)
output_header = envi_header.find_hdr_file(output_image)

# Read input header to dictionary
input_header_dict = envi_header.read_hdr_file(input_header)

# Change interleave
input_header_dict["interleave"] = output_interleave.upper()

# Write header (replace one generated by GDAL)
envi_header.write_envi_header(output_header, input_header_dict)

Add all scripts within a repository to $PATH using envmaster

I have a couple of general scripts repositories for myself and shared with colleagues. These are for scripts which are useful but don’t fit into existing projects or justify having their own repository. An example is rsgis_scripts on bitbucket. The scripts are split into different directories which aren’t available on the main path. To make them available when needed I use EnvMaster (described in a previous post). I have an envmaster module which will search the repository for folders containing executables and add them to $PATH. It will also add folders containing ‘__init__.py’ to $PYTHONPATH.

import os
import glob
REPO_PATH = "/home/dan/Documents/Development/rsgis_scripts"
# Walk through directory
for d_name, sd_name, f_list in os.walk(REPO_PATH):
    # Ignore hidden directories
    if not d_name.startswith(".") and not ".git" in d_name and not ".hg" in d_name:
        file_list = glob.glob(os.path.join(d_name,"*"))
        # Check if a directory contains executable files
        if True in [(os.path.isfile(f) and os.access(f, os.X_OK)) for f in file_list]:
        # Check for Python libraries
        if len(glob.glob(os.path.join(d_name,"*","__init__.py"))) > 0:

The script is saved as ‘rsgis_scripts’ in ‘$ENVMASTERPATH’.
To load the module and prepend all the folders to ‘$PATH’ use:

envmaster load rsgis_scripts

They will remain on $PATH until you close the terminal or unload the module using:

envmaster unload rsgis_scripts

This script is just an example, it would be easy to modify for different use cases. If you have multiple repositories a module file can be created for each one.

Convert EASE-2 grid cell to latitude and longitude using Python

The EASE-2 grid is used for a number of NASA datasets including SMAP. It is described in the following paper:

Brodzik, M. J., Billingsley, B., Haran, T., Raup, B., & Savoie, M. H. (2012). EASE-Grid 2.0: Incremental but Significant Improvements for Earth-Gridded Data Sets. ISPRS International Journal of Geo-Information, 1(3), 32–45. http://doi.org/10.3390/ijgi1010032

Files with the centre coordinate of each cell for EASE-2 grids at different resolutions are available from https://nsidc.org/data/ease/tools.html as well as tools for conversion.

To read these files into Python the following steps can be used:

  1. Download the relevant gridloc file.

    The FTP link for the grid location files is: ftp://sidads.colorado.edu/pub/tools/easegrid2/

    For this example I’ve chosen the 36km cylindrical EASE-2 grid (gridloc.EASE2_M36km.tgz)

  2. Un-tar using:
    tar -xvf gridloc.EASE2_M36km.tgz
  3. Read the files into Python:
    import numpy
    # Read binary files and reshape to correct size
    # The number of rows and columns are in the file name
    lats = numpy.fromfile('EASE2_M36km.lats.964x406x1.double', 
    lons = numpy.fromfile('EASE2_M36km.lons.964x406x1.double', 
    # Extract latitude and longitude
    # for a given row and column 
    grid_row = 46
    grid_column = 470
    lat_val = lats[grid_row, grid_column]
    lon_val = lons[grid_row, grid_column]

Calculations on large Raster Attribute Tables using ratapplier

The RIOS Library (http://rioshome.org) provides two methods of manipulating columns within a Raster Attribute Table (RAT):

  1. The ‘rat’ module can read an entire column to memory.
    from osgeo import gdal
    from rios import rat
    # Open RAT dataset
    rat_dataset = gdal.Open("clumps.kea", gdal.GA_Update)
    # Get columns with average red and NIR for each object
    red = rat.readColumn(rat_dataset, "RedAvg")
    nir = rat.readColumn(rat_dataset, "NIR1Avg")
    ndvi = (nir - red) / (nir + red)
    # Write out column
    rat.writeColumn(rat_dataset, "NDVI", ndvi)
    # Close RAT dataset
    rat_dataset = None
  2. The newer ‘ratapplier’ module is modelled after the ‘applier’ module for images and allows a function to be applied to chunks of rows, making it particularly useful for a RAT which is too large to load to memory.
    from rios import ratapplier
    def _ratapplier_calc_ndvi(info, inputs, outputs):
        Calculate NDVI from RAT.
        Called by ratapplier
        # Get columns with average red and NIR for each object
        # within block
        red = getattr(inputs.inrat, "RedAvg")
        nir = getattr(inputs.inrat, "NIR1Avg")
        # Calculate NDVI
        ndvi = (nir - red) / (nir + red)
        # Save to 'NDVI' column (will create if doesn't exist)
        setattr(outputs.outrat,"NDVI", ndvi)
    if __name__ == "__main__":
        # Set up rat applier for input / output
        in_rats = ratapplier.RatAssociations()
        out_rats = ratapplier.RatAssociations()
        # Pass in clumps file
        # Same file is used for input and output to write
        # to existing RAT
        in_rats.inrat = ratapplier.RatHandle("clumps.kea")
        out_rats.outrat = ratapplier.RatHandle("clumps.kea")
        # Apply function to all rows in chunks
        ratapplier.apply(_ratapplier_calc_ndvi, in_rats, out_rats)

Although using ratapplier looks slightly more complicated at first writing scripts to use it rather than the ‘rat’ interface means they will scale much better to larger datasets.

Both these examples are available to download from https://bitbucket.org/snippets/danclewley/7E666

Clip and Reproject an image to a master file using RSGISLib and GDAL

Nathan Thomas

Through my research I handle a plethora of raster datasets, each with a variety of resolutions and projections that I have to clip, reproject and resample to a set of consistent parameters, derived from a ‘master’ dataset. This would usually require multiple processing steps and a series of inputs, such as the extents of the image and the input projection (possibly a WKT file). The greater the number of inputs that are required the greater the chance that exists of an error occurring.

Here is an alternative method using a simple combination of RSGISLib and GDAL, which requires very little user input. This method relieves the user from having to retrieve the parameters of the input dataset from the image file, if these parameters are unknown. This is particularly useful if a user wishes to clip a dataset at a series of different study sites, each with a different spatial extent and possibly even different projection. This method is aimed at being simple to implement and would suit a beginner in Python looking to get multiple input datasets to a set of standard parameters, without requiring the knowledge to open an image and retrieve the required extent, resolution and projection.

Create Copy Image

The first step implements a command in RSGISLib to create a blank copy of a master dataset, populated with 0 values. This creates a blank image of a single value that has the parameters of the input file such as projection information, extent and resolution. The number of bands can also be specified so that a 7-band image can be created with the parameters from a single band image input. The RSGISLib command is as follows:

import rsgislib

inMask = 'inMaskImage.kea'
outName = 'outImage.kea'
numberOfBands = 1
GDALformat = 'KEA'
GDALtype = rsgislib.TYPE_32FLOAT

rsgislib.imageutils.createCopyImage(inMask, outName, \
 numberOfBands, 0, GDALformat, GDALtype)

GDAL Reproject Image

The second step implements a GDAL command to open a second input dataset (the one to clip) as read only and effectively populate the blank image created in step one with the values from the second input dataset. The output of this is a clipped image with the parameters of the master input file. The process is as follows:

from osgeo import gdal

ClipFile = 'largeRaster.kea'
outName = 'outImage.kea'
interpolationMethod = gdal.GRA_CubicSpline

inFile = gdal.Open(ClipFile, gdal.GA_ReadOnly)
outFile = gdal.Open(outName, gdal.GA_Update)

gdal.ReprojectImage(inFile, outFile, None, None, \


The script is simple to implement and requires a number of inputs:

  • ‘-m’ Input Mask – The input file with the desired parameters
  • ‘-I’ Input Image – The image you wish to clip
  • ‘-o’ Output Image – The output image
  • ‘-n’ Number of Bands – The number of bands that the clipped scene requires (ie, to clip an RGB dataset the number of bands would be 3)
  • ‘-of’ GDAL format – The format of the output image (i.e. KEA)

An example of the implementation may look like:

python CutReproject.py –m N00E103_HH.kea \
        –i GlobalSRTMmosaic.kea \
        –o N00E103_SRTM.kea –n 1 –of KEA

CutReproject.py is available to download from bitbucket.org/nathanmthomas/bucket-of-rs-and-gis-scripts/