Category Archives: Python

A script to find and download GEDI passes

The Global Ecosystem Dynamics Investigation (GEDI) is a spaceborne lidar instrument mounted on the International Space Station (ISS). The GEDI instrument is a geodetic-class lidar with 3 lasers that produce 8 parallel tracks. Each laser illuminates a 25 m footprint on the ground and fires at a rate of 242 times per second. Each footprint is separated by 60 m in the along track direction, at an across-track distance of 600 m between each of the 8 tracks. GEDI’s precise measurements of forest canopy height, canopy vertical structure, and surface elevation will be critical to characterizing and understanding carbon and water cycling processes to further our knowledge of the world we live in:

GEDI is mounted on the ISS and therefore its orbit is dictated by that of the ISS. This prevents repeat acquisitions in the same way that Landsat or Sentinel satellite data is collected. While this enables wider coverage, its data acquisition is not consistent. Further to this, GEDI has the ability to point its lasers so that the area on the ground imaged is not necessarily that at nadir beneath the sensor.

To date, GEDI data is available via two means:

Earthdata ( provides the data but does not currently provide sufficient visualization so that users can see whether the data intersects their ROI.

Alternatively, the NASA LP DAAC provide the data in list/ftp format ( This allows easy access to all the data but requires that the user knows the orbit that they require. Since February 2020, this has been supplemented by a web tool called GEDI Finder ( Users pass in bounding box dimensions alongside the GEDI product and version they require and it returns a subset of the list of data that intersects their bounding box. While this is a simple method for accessing data, it requires that a user either manually downloads each one or passes each name to software (wget, curl) to pull the data. was written to make the best of these existing tools and enable the automation of the whole search and download process. It relies heavily on the GEDI Finder tool to search for the data that intersects a bounding box and is, in its simplest form, a wrapper for this tool. It also allows an optional date range to be specified if users are only interested in data collected during a specific time period. requires a number of command line options that include GEDI product, version, bounding box, output path and Earthdata login credentials. An overview and example of is:

python -p GEDI02_B -v 001 \
-bb 0.3714633 9.277247 -0.08164874 10.00922 \
-d 2019-01-01 2019-04-30 \
-o /Users/Me/Data/Gedi \
-u MyEarthDataUsername -pw MyEarthDataPassword


-p is the GEDI product (e.g., GEDI02_B)

-v is the GEDI version (e.g., 001)

-bb is the bounding box given in UpperLeftLon (maxY), UpperLeftLat (minx), LowerRightLon (minY) and LowerRightLat (maxX). These should be passed to the terminal separated by spaces.

-d is the date range specified in the format YYYY-MM-DD with the start and end data separated with a space

-o is the local path where you want to download the data

-u is your EarthData login username (Note an EarthData account is required)

-pw is your EarthData login password


python -h

will also provide this information. contains 3 functions.

  • The first constructs the command line options and passes it to the existing GEDI Finder tool and pulls a list of the GEDI files that are on the html webpage.
  • The second function parses the search results pulled from the GEDI Finder URL and constructs a list where each element is a separate GEDI H5 file.
  • The final function iterates over the list of H5 files and pulls each one using wget software. is available here: requires wget is installed.  On Linux this can be installed through the package manager if not already installed, on macOS you can install this from conda-forge using:

conda create -n gedi -c conda-forge python wget

The command will print the number of files found then start downloading them. Each file is approximatly 1 GB.

Author: Nathan Thomas (@DrNASApants)

Nathan is a UMD Earth System Science Interdisciplinary Center (ESSIC) PostDoc positioned at the NASA Goddard Space Flight Center. Nathan’s research is focused primarily around land cover mapping and characterizing the above ground structure of vegetation, particularly mangrove forests. Through this he uses python to pull, preprocess, calibrate, analyze and display remote sensing info. Some of this code is distributed through his bitbucket:

Working with hyperspectral data in ENVI BIL format using Python

For working with ENVI files I normally use GDAL as code can then be applied to different formats. However, there are a couple of limitations with GDAL when working with hyperspectral data in ENVI format:

  1. GDAL doesn’t copy every item from the header file to a new header file if they don’t fit in with the GDAL data model. Examples are FWHM and comments. Sometimes extra attributes are copied to the aux.xml file GDAL creates, these files aren’t read by ENVI or other programs based on IDL (e.g., ATCOR).
  2. For data stored Band Interleaved by Line (BIL) rather than Band Sequential (BSQ) reading and writing a band at a time is inefficient as it is necessary to keep jumping around the file

To overcome these issues NERC-ARF-DAN use their own Python functions for reading / writing header files and loading BIL files a line at a time. These functions have been tidied up and released through the NERC-ARF Tools repository on GitHub ( The functions depend on NumPy.

To install them it is recommended to use the following steps:

  1. Download miniconda from and follow the instructions to install.
  2. Open a ‘Command Prompt’ / ‘Terminal’ window and install numpy by typing:
    conda install numpy
  3. Download ‘arsf_tools’ from GitHub (
  4. Unzip and within a ‘Command Prompt’ or ‘Terminal’ window navigate to the the folder using (for example):
    cd Downloads\arsf_tools-master
  5. Install the tools and library by typing:
    python install

Note, if you are using Linux you can install the arsf_binary reader from which is written in C++. The ‘arsf_envi_reader’ module will import this if available as it is faster than the standard NumPy BIL reader.

If you are a UK based researcher with access to the JASMIN system the library is already installed and can be loaded using:

module load contrib/arsf/arsf_tools

An simple example of reading each line of a file, adding 1 to every band and writing back out again is:

from arsf_envi_reader import numpy_bin_reader
from arsf_envi_reader import envi_header

# Open file for output
out_file = open("out_file.bil", "w")

# Open input file
in_data = numpy_bin_reader.BilReader("in_file.bil")

for line in in_data:
out_line = line + 1

# Copy header

# Close files
in_data = None

A more advanced example is applying a Savitzky-Golay filter to each pixel. As the filter requires every band for each pixel it is efficient to work with BIL files.

For developing algorithms using spatial data, in particular multiple files it is recommended to convert the files to another band-sequential format using the ‘gdal_translate’ so they can be read using programs which use GDAL, for example RIOS or RSGISLib.

To convert files from ENVI BIL to ENVI BSQ and copy all header attributes the ‘envi_header’ module can be used after converting the interleave with GDAL. For example:

import os
import subprocess
from arsf_envi_reader import envi_header

# Set input and output image (could get with argparse)
input_image = "input.bil"
output_image = "output_bsq.bsq"

# Get interleave from file extension
output_interleave = os.path.splitext(output_image)[-1].lstrip(".")

# 1. Convert interleave
print("Converting interleave to {}".format(output_interleave.upper()))
gdal_translate_cmd = ["gdal_translate",
                      "-of", "ENVI",
                      "-co", "INTERLEAVE={}".format(output_interleave.upper())]
gdal_translate_cmd.extend([input_image, output_image])

# 2. Copy header (GDAL doesn't copy all items)
# Find header files
input_header = envi_header.find_hdr_file(input_image)
output_header = envi_header.find_hdr_file(output_image)

# Read input header to dictionary
input_header_dict = envi_header.read_hdr_file(input_header)

# Change interleave
input_header_dict["interleave"] = output_interleave.upper()

# Write header (replace one generated by GDAL)
envi_header.write_envi_header(output_header, input_header_dict)

A script to batch download files using PycURL

When downloading a lot of large files (e.g., remote sensing data), it is difficult to do using a browser as you don’t want to set all the files downloading at once and sitting round waiting for one download to finish and another to start is tedious. Also you might want to download files to somewhere other than your ‘Downloads’ folder.

To make downloading files easier you can use the available on Bitbucket.

The script uses PycURL, which is available to install through conda using:

conda install pycurl

It takes a text file with a list of files to download as input, if you were downloading files using a browser you can right click on the line and select ‘Copy Link Location’ (the exact text will vary depending on the browser you are using) instead of downloading the file. Some files follow a logical pattern so you could use this to get a couple of links and then just copy and paste changing as required (e.g., the number of the file, location, date etc.,).

Once you have the list of files to download the script is run using:

python \
     --filelist ~/Desktop/JAXAFileNamesCut.txt \
     --failslist ~/Desktop/JAXAFileNamesFails.txt \
     --outputpath ~/Desktop/JAXA_2010_PALSAR/ 

Any downloads which fail will be added to the file specified by the ‘–failslist’ argument.

By default the script will check if a file has already been downloaded and won’t download it again, you can skip this check using ‘–nofilecheck’. It is also possible to set time to pause between downloads with ‘–pause’, to avoid rejection from a server when doing big downloads. For all the available options run:

python  --help

As it is running from the command line, you can set it running on one machine (e.g., a desktop at the office) and check on the progress remotely (e.g., from your laptop at home) using ssh. So the script keeps running when you you close the session you can run within GNU Screen, which is installed by default on OS X. To start it type:


then type ctrl+a ctrl+d to detach the session. You can reattach using:

screen -R

to check the progress of your downloads.
Alternatively you can use tmux. However this isn’t available by default.

Plot 2D Histogram of Pixel Values from Two Images

Recently I wanted to plot the pixel values of two images against each other. I though it would be good to combine extracting the pixel values (using RIOS) and plotting the data (using matplotlib) in a single script.

The script I wrote ( is available to download from the RSGIS Scripts repository

There are three main steps:

  1. RIOS is used to iterate through the image in small blocks, for each block only pixels that have data in each band are retained.
  2. The numpy.histogram2d is used to produce a 2D histogram.
  3. The histogram is plotted using matplotlib.

The 2D histogram was based on code from As there are a lot of points, displaying as a 2D histogram makes the information easier to see.

The simplest usage of the script is:

python \
--imageA imageA.kea \
--imageB imageB.kea \
--plot outplot.pdf 

Which will plot the first band of each image. To change the bands used, labels for the plot and scaling additional flags can be passed in.

python \
--imageA imageA.kea \
--imageB imageB.kea \
--bandA 1 --bandB 1 \
--plot outplot.pdf \
--plotMin 0 --plotMax 0.5 \ 
--labelA "13157 - 004 (HH)" \
--labelB "13157 - 006 (HH)" \

The output looks something like this:

The script assumes both images are the same projection and resolution, and RIOS takes care of the spatial information. It would be possible to adapt so RIOS resamples the data, see the RIOS Wiki for more details.

Convert SSURGO soil data to a Raster Attribute Table

SSURGO (Soil Survey Geographic database) provides soil information across the United States. The data is provide as Shapefiles with the mapping units. The attributes for each polygon are stored as a text files, which need to be imported into an Access database and linked with the shapefile.

An alternative for working with SSURGO data is to convert the shapefile to a raster, parse the text files and store the attributes for each mapping unit as a Raster Attribute Table (RAT).

To do this the following steps are required:

  1. Use gdal_rasterize to create a raster.
  2. Use RSGISLib to convert to a RAT.
  3. Add a column for each attribute using RIOS.

A Python script to perform these steps can be downloaded from

An example of usage is:

python --indir CA669 \
  --colname claytotal_ \
  --outformat KEA

A list of all available columns can be viewed using:

python --printcols

To export the columns as a standard raster (using RSGISLib) pass in the ‘–raster’ flag.

SSURGO data can be downloaded from the USDA NRCS Geospatial Data Portal (

Import CSV into Python using Pandas

One of the features I like about R is when you read in a CSV file into a data frame you can access columns using names from the header file. The Python Data Analysis Library (pandas) aims to provide a similar data frame structure to Python and also has a function to read a CSV. Once pandas has been installed a CSV file can be read using:

import pandas
data_df = pandas.read_csv('in_data.csv')

To get the names of the columns use:


And to access columns use:

colHH = data_df['colHH']

Or if the column name is a valid Python variable name:

colHH = data_df.colHH

This is only a tiny part of pandas, there are lots of features available (which I’m just getting into). One interesting one is the ability to create pivot table reports from a data frame, similar to Excel.

Recursively finding files in Python using the find command

The find command on a UNIX system is great for searching through directories for files. There are lots of option available but the simplest usage is:

find PATH -name 'SEARCH'

Combined with the subprocess module in Python, it’s easy to use this search capability within a Python script:

import subprocess

# Set up find command
findCMD = 'find . -name "*.kea"'
out = subprocess.Popen(findCMD,shell=True,stdin=subprocess.PIPE, 
# Get standard out and error
(stdout, stderr) = out.communicate()

# Save found files to list
filelist = stdout.decode().split()

The above code can be used to get the output from any command line tool in Python.

As noted in the comment by Sam, a more portable way of finding files is to use the os.walk function combined with fnmatch:

import os, fnmatch

inDIR = '/home/dan/'
pattern = '*kea'
fileList = []

# Walk through directory
for dName, sdName, fList in os.walk(inDIR):
    for fileName in fList:
        if fnmatch.fnmatch(fileName, pattern): # Match search string
            fileList.append(os.path.join(dName, fileName))

Reading MATLAB files with Python

If you need to read MATLAB (.mat) data files, there is a function within which allows you to do this. Usage is simple and well explained in the tutorial:

  1. Import file:
  2. from scipy import io
    inMATFile = 'ssurgo_data.mat'
    soildata = io.loadmat(inMATFile)
  3. Get a list of keys:
  4. soildata.keys()
  5. Extract data to a NumPy array:
  6. soildata_varname = soildata['varname']

Reading tabular data in a non-standard format

This post was going to be called ‘faffing around with data’, which might summarise many peoples feelings towards dealing with data that isn’t in the format they need it to be in. An earlier post described adding an ENVI header to binary files so they could be read into GDAL. This deals post details using Python to read tabular data stored in a text file, although many of the functions would apply to extracting any data from a text file.

Built in functions

Before setting out it’s a good idea to confirm the data can’t be read with any of the functions available in Python. If it opens in Excel, it’s likely these will work and they may be an easier option (no point reinventing the wheel!)

The CSV module in the standard Python library is really easy to use and can handle various delimiters:

import csv

# Open file
f = open('soil_moisture.csv', 'rU')
csvfile = csv.reader(f, delimiter=',')

# Skip first line (header)
headerline = next(csvfile)

# Iterate through file
for line in csvfile:
 print('First col = ',line[0])


In numpy the genfromtxt function can be used to read a file directly to a numpy array.

smdata = np.genfromtxt('soil_moisture.csv', delimiter=',',skip_header=1)

Depending on how many files you have to process, a bit of clever use of find and replace in a text editor or using the sed command might allow you to get your data into a format that can be read by the built in functions. If this does end up being too messy it will at least provide some insight into the steps needed to write a script to read the data.

Writing a Custom Reader

The basic structure of a custom reader is to iterate through each line in the file and split it into columns.

# Open file
infile = open('soil_moisture.csv', 'rU')

# Read header to separate variable (repeat for multiple lines)
header1 = next(file)

# Iterate through file
for line in infile:
    # Count delimiters (:)
    ncolon = line.count(':')
    # Check we have 3 (4 columns)
    if ncolon == 3:
        # Seperate
        elements = line.split(':')
        print('First col = ', elements[0])

If the data is fixed width and separated by multiple spaces, I use the re module in Python to replace consecutive spaces with another character.

import re
line = 'tonzi  2013   12  18 1  1 9  1   20   15.69'
linesub = re.sub('\s+',':',line)
elements = linesub.split(':')

Where ‘\s+’ is a regular expression for one or more spaces. When choosing the character to use as a replacement, make sure it’s not used anywhere else in the file.

Time Data

Time can be stored in a range of different formats. Python offers ways of storing and manipulating time and data data as part of the time module. One of the easiest ways of getting time from a text file and into the Python representation is to use the ‘struct_time’ function. If each component of the date (year, month etc.,) is stored as a separate column you can use the following:

import time
year = int('20' + line[1])
month = int(line[2])
day = int(line[3])
hour = int(line[4])
minute = int(line[5])
second = int(line[6])

mTimeTS = time.struct_time((year,month,day,hour,minute,second,0,0,0))

If the date is stored in a single column the ‘strptime’ function can be used to convert to a Python time structure. For example:

import time
timeStr = '2014-07-10 08:14:01'
mTimeTS = time.strptime(timeStr,'%Y-%m-%d %H:%M:%S')

Writing data out

The CSV library, mentioned earlier, can also be used to write formatted data out to a CSV file.

outfile = open('outdata.csv', 'w')

outcsv = csv.writer(outfile)

# Write header row

# Itterate through input file
for line in infile:
    # Read and split data into 'elements'
    # as descriped earlier
    # Write first two columns out to CSV


Join two CSV files based on a common field

Sometimes it’s necessary to join two CSV files together using a common field. I wrote a python script a while ago to perform this task using the CSV python library ( The script is run from the command line and takes the input and output CSV files and the numbers of the column to match (starting at 0).

For example to join attributes from inMatchFileName.csv to inRefFileName.csv, by matching the first column you can use:

python inRefFileName.csv \
    inMatchFileName.csv outFileName.csv 0 0

Note this assumes there is only one record for each ID, if there are multiple (as in a relational database) this simple aproach won’t work and actually loading the data into a relational database would be better. My recomendation for this is SQLite, a self-contained database (i.e., stored in a single file and doesn’t require setting up a server), that can be accessed from Python and R.