Converting a Jupyter Notebook to a Python Script

Jupyter notebooks are great for developing a method but what about if you need to run the method for lots of files? This post describe how to go from a notebook to a script which you can run on the command line and apply to multiple files.

For this example we’ll be using a notebook written to calculate the Floating Debris Index (FDI; [1]) and apply to a Sentinel 2 netCDF file processed using ACOLITE. If you are interested in running the notebook a tutorial on using ACOLITE is available from a NEODAAS training course. You can download ACOLITE and the manual from here. The xarray library is used to read the netCDF file, calculate FDI and write back out to a netCDF file.

[1] Biermann, L., Clewley, D., Martinez-Vicente, V. et al. Finding Plastic Patches in Coastal Waters using Optical Satellite Data. Sci Rep 10, 5364 (2020). https://doi.org/10.1038/s41598-020-62298-z

1. Tidy up notebook

The first step is to tidy up the notebook so it only contains the code that is required. This means removing cells which print out lots of data or produce plots (unless plots are an output of the notebook).

2. Export as Python

From File -> Export Notebook As… in the Jupyter Lab menu select ‘Export Notebook to Executable Script’. This will give you a Python script to download.

If you copy the script to the same directory as the notebook and either open a terminal / command prompt and navigate to this directory or select ‘File’ -> ‘New’ -> Terminal in Jupter Lab to open a terminal within your browser you can run the script using python SCRIPT_NAME so for this example:

python s2_FDI_Calculation.py

This allows running the script outside a notebook. However, it will only run the netCDF file which has been ‘hard coded’ in the script, we want to be able to specify the input file on the command line.

3. Allow specifying input file using argparse

The input and output file are specified in the lines:

s2_data = xarray.open_dataset("S2A_MSI_2018_04_20_11_21_21_T30VWH_L2R.nc", decode_cf=False, chunks=512)

and:

fdi_xarray.to_netcdf("S2A_MSI_2018_04_20_11_21_21_T30VWH_L2R_fdi.nc", engine="netcdf4", encoding={"fdi" : {"zlib" : True, "complevel" : 4}})

Using argparse it is possible to specify file names (any many other parameters) from the command line.

For example:

import argparse
parser = argparse.ArgumentParser(description="Calculate Floating Debris Index (FDI) Biermann "
                                                " to a S2 scene")
parser.add_argument("innetcdf", help="Input netCDF file")
parser.add_argument("outnetcdf", help="Output netCDF file")
args = parser.parse_args()

Adding this at the start of the script and changing the lines

s2_data = xarray.open_dataset(args.innetcdf, decode_cf=False, chunks=512)

and:

fdi_xarray.to_netcdf(args.outnetcdf engine="netcdf4", encoding={"fdi" : {"zlib" : True, "complevel" : 4}})

Trying to run the script as before using python s2_FDI_Calculation.py will now print the following message:

usage: s2_FDI_Calculation.py [-h] innetcdf outnetcdf
s2_FDI_Calculation.py: error: the following arguments are required: innetcdf, outnetcdf

So argparse not only allows taking arguments from the command line it also provides help if the correct arguments aren’t provided. The script can now be run using:

python s2_FDI_Calculation.py S2A_MSI_2018_04_20_11_21_21_T30VWH_L2R.nc S2A_MSI_2018_04_20_11_21_21_T30VWH_L2R_fdi.nc

Providing a different input and output file will allow running for different files. However, an improvement would be allow the script to take multiple input files and run for all of them rather than having to run the script for each file.

4. Tidy up script to use functions

To make it easier to run multiple files the code which calculates FDI will be split into a separate function:

def calculate_fdi_from_netcdf(input_netcdf, output_netcdf):
    """
    Function to calculate FDI from a netCDF file
    """
    # Specify chunksize so uses dask and doesn't load all data to RAM
    s2_data = xarray.open_dataset(input_netcdf, decode_cf=False, chunks=512)
    .
    .
    .
    # Export
    print("Writing out")
    with ProgressBar():
        fdi_xarray.to_netcdf(output_netcdf, engine="netcdf4", encoding={"fdi" : {"zlib" : True, "complevel" : 4}})

The argparse lines are them moved to the bottom of the script:

if __name__ == "__main__":

    parser = argparse.ArgumentParser(description="Calculate Floating Debris Index (FDI) Biermann "
                                                    " to a S2 scene")
    parser.add_argument("innetcdf", help="Input netCDF file")
    parser.add_argument("outnetcdf", help="Output netCDF file")
    args = parser.parse_args()

    calculate_fdi_from_netcdf(args.innetcdf, args.outnetcdf)

This will only run if the the file is run as a script, allowing the function to be used in another Python script. For more details see here.

As part of the tidying up process comments with cell numbers from the notebook (e.g., # In[3]:) can be removed.

This will run the same as before, so a single input and output file is provided. The next step is to take a number of input files, an output directory and run for all of them:

import os

if __name__ == "__main__":

    parser = argparse.ArgumentParser(description="Calculate Floating Debris Index (FDI) Biermann "
                                                    " to a S2 scene")
    parser.add_argument("innetcdfs", nargs="+", help="Input netCDF file (s)")
    parser.add_argument("-o", "--outdir", help="Output directory", required=True)
    args = parser.parse_args()

    for in_netcdf in args.innetcdfs:
        print(f"Calculating FDI for {in_netcdf}...")
        # Set up output name for FDI netCDF
        out_netcdf_basename = os.path.basename(in_netcdf).replace(".nc", "_fdi.nc")
        out_netcdf = os.path.join(args.outdir, out_netcdf_basename)
        # Calculate FDI
        calculate_fdi_from_netcdf(in_netcdf, out_netcdf)
        print(f"Saved to {out_netcdf}")

This time rather than specifying an output netCDF an output directory is used (specified with -o or --outdir when running the script) and the os module is used to build a path using the input file.

The script can now be run using:

python s2_FDI_Calculation.py S2A_MSI_2018_04_20_11_21_21_T30VWH_L2R.nc S2A_MSI_2018_06_07_08_56_01_T35SMD_L2R.nc -o .

To calculate from two netCDF files and save to the current directory (specified with .). On macOS and Linux can use globs to match a list of files rather than specifying, for example:

python s2_FDI_Calculation.py S2A_MSI_*_L2R.nc -o .

The original notebook and finished Python script are available from: https://github.com/NEODAAS/example_code/tree/main/jupyter-notebook-to-script

A script to find and download GEDI passes

The Global Ecosystem Dynamics Investigation (GEDI) is a spaceborne lidar instrument mounted on the International Space Station (ISS). The GEDI instrument is a geodetic-class lidar with 3 lasers that produce 8 parallel tracks. Each laser illuminates a 25 m footprint on the ground and fires at a rate of 242 times per second. Each footprint is separated by 60 m in the along track direction, at an across-track distance of 600 m between each of the 8 tracks. GEDI’s precise measurements of forest canopy height, canopy vertical structure, and surface elevation will be critical to characterizing and understanding carbon and water cycling processes to further our knowledge of the world we live in: http://www.gedi.umd.edu.

GEDI is mounted on the ISS and therefore its orbit is dictated by that of the ISS. This prevents repeat acquisitions in the same way that Landsat or Sentinel satellite data is collected. While this enables wider coverage, its data acquisition is not consistent. Further to this, GEDI has the ability to point its lasers so that the area on the ground imaged is not necessarily that at nadir beneath the sensor.

To date, GEDI data is available via two means:

Earthdata (https://search.earthdata.nasa.gov/search) provides the data but does not currently provide sufficient visualization so that users can see whether the data intersects their ROI.

Alternatively, the NASA LP DAAC provide the data in list/ftp format (https://e4ftl01.cr.usgs.gov/GEDI/). This allows easy access to all the data but requires that the user knows the orbit that they require. Since February 2020, this has been supplemented by a web tool called GEDI Finder (https://lpdaacsvc.cr.usgs.gov/services/gedifinder). Users pass in bounding box dimensions alongside the GEDI product and version they require and it returns a subset of the list of data that intersects their bounding box. While this is a simple method for accessing data, it requires that a user either manually downloads each one or passes each name to software (wget, curl) to pull the data.

SearchPullGEDI.py was written to make the best of these existing tools and enable the automation of the whole search and download process. It relies heavily on the GEDI Finder tool to search for the data that intersects a bounding box and is, in its simplest form, a wrapper for this tool. It also allows an optional date range to be specified if users are only interested in data collected during a specific time period. SearchPullGEDI.py requires a number of command line options that include GEDI product, version, bounding box, output path and Earthdata login credentials. An overview and example of SearchPullGEDI.py is:

python SearchPullGedi.py -p GEDI02_B -v 001 \
-bb 0.3714633 9.277247 -0.08164874 10.00922 \
-d 2019-01-01 2019-04-30 \
-o /Users/Me/Data/Gedi \
-u MyEarthDataUsername -pw MyEarthDataPassword

whereby:

-p is the GEDI product (e.g., GEDI02_B)

-v is the GEDI version (e.g., 001)

-bb is the bounding box given in UpperLeftLon (maxY), UpperLeftLat (minx), LowerRightLon (minY) and LowerRightLat (maxX). These should be passed to the terminal separated by spaces.

-d is the date range specified in the format YYYY-MM-DD with the start and end data separated with a space

-o is the local path where you want to download the data

-u is your EarthData login username (Note an EarthData account is required)

-pw is your EarthData login password

Running:

python SearchPullData.py -h

will also provide this information.

SearchPullGedi.py contains 3 functions.

  • The first constructs the command line options and passes it to the existing GEDI Finder tool and pulls a list of the GEDI files that are on the html webpage.
  • The second function parses the search results pulled from the GEDI Finder URL and constructs a list where each element is a separate GEDI H5 file.
  • The final function iterates over the list of H5 files and pulls each one using wget software.

SearchPullGEDI.py is available here: https://bitbucket.org/nathanmthomas/bucket-of-rs-and-gis-scripts/src/master/SearchPullGEDI.py

SearchPullGEDI.py requires wget is installed.  On Linux this can be installed through the package manager if not already installed, on macOS you can install this from conda-forge using:

conda create -n gedi -c conda-forge python wget

The command will print the number of files found then start downloading them. Each file is approximatly 1 GB.

Author: Nathan Thomas (@DrNASApants)

Nathan is a UMD Earth System Science Interdisciplinary Center (ESSIC) PostDoc positioned at the NASA Goddard Space Flight Center. Nathan’s research is focused primarily around land cover mapping and characterizing the above ground structure of vegetation, particularly mangrove forests. Through this he uses python to pull, preprocess, calibrate, analyze and display remote sensing info. Some of this code is distributed through his bitbucket: https://bitbucket.org/nathanmthomas/bucket-of-rs-and-gis-scripts/src/master/

Converting NEODAAS Mercator Projection netCDF files to GeoTiffs for use in QGIS / ArcMap

NetCDF files are a common format for distributing Earth Observation data and allow the ability to store a number of variables alongside metadata. However, using netCDF files in a GIS is not always as easy as it could be.

The NERC Earth Observation Data Acquisition and Analysis Service (NEODAAS) routinely produce products such as Chlorophyll from EO data and store as netCDF files. For the UK they use a Mercator projection within a netCDF file storing the latitude and longitude of each pixel within separate arrays. Unfortunately QGIS and ArcMap are often unable to read this information so don’t read data into the correct location making it difficult to use with other datasets.

To read data into the correct location I wrote a script which converts the latitude and longitude values in the netCDF file into tie points and then uses these to warp a GeoTiff into the correct location. It exports a single variable at a time.

To use for creating a GeoTiff from Chlorophyll data:

python reproject_neodaas_netcdf.py \
sentinel3a_olci_all_products_L3 median_uk_7d_20180720_20180726.nc \
sentinel3a_olci_all_products_L3-median_uk_7d_20180720_20180726_chl.tif \
--variable CHL_OC4ME

The script is below. It requires the GDAL and netCDF Python libraries.

#!/usr/bin/env python
"""
Create a single layer tif in WGS84 projection from
NEODAAS (https://www.neodaas.ac.uk/) netCDF files in mercator projection.
Converts the location array to GCPs and uses these to warp the image with
GDAL.
Dan Clewley (dac@pml.ac.uk).
2018-11-12
"""
import argparse
import os
import shutil
import tempfile
import netCDF4
from osgeo import gdal
# The geolocation grid in the netCDF file contains a location for
# each pixel. This is often more than GDAL will handle as GCPs so set
# a resampling factor to reduce.
GEOLOCATION_GRID_RESAMPLING = 50
def create_single_var_tiff(input_netcdf, output_tif, variable):
"""
Export netCDF as a single variable GeoTiff
"""
# Create single variable tif from input netCDF
in_ds = gdal.Open('NETCDF:"{}":{}'.format(input_netcdf, variable), gdal.GA_ReadOnly)
gdal.Translate(output_tif, in_ds)
in_ds = None
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Reproject NetCDF")
parser.add_argument("innetcdf", nargs=1, type=str,
help="Input netCDF file")
parser.add_argument("outfile", nargs=1, type=str,
help="Output file (.tif)")
parser.add_argument("--variable", required=True,
help="Variable from netCDF to extract (e.g., CHL_OC4ME)")
args = parser.parse_args()
# Make temporary directory
temp_dir = tempfile.mkdtemp(prefix="netcdf_convert_")
temp_tif = os.path.join(temp_dir, os.path.basename(args.innetcdf[0]))
# Export a single layer as a geotiff
create_single_var_tiff(args.innetcdf[0], temp_tif, args.variable)
# Open new geotiff to add GCPs to
data = gdal.Open(temp_tif, gdal.GA_Update)
# Open input netCDF
data_nc = netCDF4.Dataset(args.innetcdf[0])
gcp_list = []
# Go through longitude and latitude arrays and convert to GCPs
for x in range(1, data_nc.variables['longitude'].size, GEOLOCATION_GRID_RESAMPLING):
for y in range(1, data_nc.variables['latitude'].size, GEOLOCATION_GRID_RESAMPLING):
gcp_list.append(gdal.GCP(float(data_nc.variables['longitude'][x]),
float(data_nc.variables['latitude'][y]), 0,
x, y))
# Add GCPs to tif. Set projection is None to ignore it
data.SetGCPs(gcp_list, None)
# Warp file based on GCPs
gdal.Warp(args.outfile[0], data, dstSRS='EPSG:4326', format='GTiff')
# Close files
data = None
data_nc = None
# Remove temporary directory
shutil.rmtree(temp_dir)

Note although the files should line up with other datasets you are warping the data which may introduce some errors. For best results, if you have a research project funded by NERC or are eligible for a NERC research grant or training award you can contact NEODAAS to discuss specific processing requirements.

Working with full waveform LiDAR data in SPDLib (part 2)

The first post of this series was written quite a while ago now. Apologies it has taken so long for a follow up. Since the first post has been written there have been two exciting developments:

  1. The methods described for generating full waveform metrics have been used to perform the LiDAR analysis for a paper led by Chloe Brown, University of Nottingham.Brown, C.; Boyd, D.S.; Sjögersten, S.; Clewley, D.; Evers, S.L.; Aplin, P. Tropical Peatland Vegetation Structure and Biomass: Optimal Exploitation of Airborne Laser Scanning. Remote Sens. 2018, 10, 671. https://doi.org/10.3390/rs10050671
  2. SPDLib is now available on Windows, macOS and Linux through conda-forge (as is RSGISLib). See part one for updated install instructions.

At the end of part one we had imported the LAS 1.3 file into SPDLib and decomposed the waveforms. This next section will cover ground classification and metrics generation.

  1. Spatially index data

    In part one we had been working with an SPD file without a spatial index (UPD file). However, for subsequent processing steps a spatial index is needed so a spatially indexed file is generated using the spdtranslate command.

    As the gridding can use a lot of RAM we are going to process in tiles and then stitch them together. We create a temporary directory to store the tiles using:

    mkdir spd_tmp
    

    Then run the translate command:

    spdtranslate --if SPD --of SPD \
                 -x LAST_RETURN \
                 -b 1 \
                 --temppath spd_tmp \
                 -i LDR-FW-RG13_06-2014-303-05_subset_decomp.spd \
                 -o LDR-FW-RG13_06-2014-303-05_subset_decomp_gridded.spd
    

    The temporary directory can be removed after processing has completed:

    rm -fr spd_tmp
    
  2. Classify ground returns and populate height

    Many metrics use the height about ground rather than absolute elevation so this must be defined. To derive heights from LiDAR data it is first necessary to determine the ground elevation so heights can be calculated above this. Within SPDLib the ground classification results are achieved using a combination of two classification algorithms: a Progressive Morphology Filter (PMF; [1]) followed by the Multi-Scale Curvature algorithm (MCC; [2]). Both these algorithms use only the discrete points rather than the waveform information.

    Apply a Progressive Morphology Filter using the following command:

    spdpmfgrd --grd 1 -i LDR-FW-RG13_06-2014-303-05_subset_decomp_gridded.spd \
              -o LDR-FW-RG13_06-2014-303-05_subset_decomp_gridded_pmf_grd.spd
    

    Then apply the Multi-Scale Curvature algorithm to the output file using:

    spdmccgrd --class 3 --initcurvetol 1 \
              -i LDR-FW-RG13_06-2014-303-05_subset_decomp_gridded_pmf_grd.spd \
              -o LDR-FW-RG13_06-2014-303-05_subset_decomp_gridded_pmf_mcc_grd.spd
    
  3. Attribute with height

    The final step of the SPD processing is to attribute each pulse with heights above ground level. An interpolation is used for ground points, similar to generating a Digital Terrain Model (DTM), but rather than using a regular grid the ground height is calculated for the position of each point.

    spddefheight --interp --in NATURAL_NEIGHBOR_CGAL \
                 -i LDR-FW-RG13_06-2014-303-05_subset_decomp_gridded_pmf_mcc_grd.spd \
                 -o LDR-FW-RG13_06-2014-303-05_subset_decomp_gridded_pmf_mcc_grd_defheight.spd
    
  4. Calculate metrics

    After all the pre-processing steps to convert the LAS 1.3 file into a gridded SPD format file with a defined height it is possible to generate a number of metrics from the waveform data. The command to calculate metrics within SPDLib (`spdmetrics`) takes an XML file in which the metrics are defined. There are a large number of metrics available and operators (addition, subtraction etc.,) allowing existing metrics to be combined to implement new metrics. The full list of metrics is available in the The full list of metrics is available in the SPDMetrics.xml file, distributed with the source of SPDLib. Most metrics have an option to specify the minimum number of returns (`minNumReturns`), setting this to 0 will use the waveform information to calculate the metric, setting to 1 (default) or above will use the discrete data. In this way full waveform and discrete metrics can be created at the same time.For this exercise we will be calculating Height of Medium Energy (HOME) and waveform distance (WD), a detailed description of these metrics is given in [3].

    First, create a file containing these metrics. Create a text file called ‘spd_metrics.xml’ and paste the text below into it:

    
    <!-- SPDLib Metrics file -->
    
    <spdlib:metrics
    xmlns:spdlib="http://www.spdlib.org/xml/">
    
    <!-- HOME -->
    
    <spdlib:metric metric="home" field="HOME"/>
    
    <!-- WD -->
    
    <spdlib:metric metric="maxheight" field="WD" minNumReturns="0"/>
    
    </spdlib:metrics>
    

    If this doesn’t display, try copying from https://gist.github.com/danclewley/4eefda2200e7593f1e5e2aaa6bae2c03

    To calculate the metrics and produce an image as an output run.

    spdmetrics --image -o LDR-FW-RG13_06-2014-303-05_subset_metrics.bsq \
               -f ENVI \
               -i LDR-FW-RG13_06-2014-303-05_subset_decomp_gridded_pmf_mcc_grd_defheight.spd \
               -m spd_metrics.xml
    

    Once the command has finished, open the metrics image using:

    tuiview LDR-FW-RG13_06-2014-303-05_subset_metrics.bsq
    

More metrics can be added to the ‘spd_metrics.xml’ file as needed, it is also possible to define new metrics using the operator tags.

This post was derived from the LiDAR practical given as part of the NERC-ARF workshop held at BAS, Cambridge in March 2018. If you have any questions about working with NERC-ARF data contact the NERC-ARF Data Analysis Node (NERC-ARF-DAN) see https://nerc-arf-dan.pml.ac.uk/ or follow us on twitter: @NERC_ARF_DAN.

[1] Keqi Zhang, Shu-Ching Chen, Whitman, D., Mei-Ling Shyu, Jianhua Yan, & Zhang, C. (2003). A progressive morphological filter for removing nonground measurements from airborne LIDAR data. IEEE Transactions on Geoscience and Remote Sensing, 41(4), 872–882. http://doi.org/10.1109/TGRS.2003.810682
[2]: Evans, J.S., Hudak, A.T., 2007. A multiscale curvature algorithm for classifying discrete return lidar in forested environments. IEEE Transactions on Geoscience and Remote Sensing 45 (4), 1029–1038.
[3] Cao, L., Coops, N., Hermosilla, T., Innes, J., Dai, J., & She, G. (2014). Using Small-Footprint Discrete and Full-Waveform Airborne LiDAR Metrics to Estimate Total Biomass and Biomass Components in Subtropical Forests. Remote Sensing, 6(8), 7110–7135. http://doi.org/10.3390/rs6087110

Find scenes from an archive which contain a point

This is a quick way of locating which scenes, from an archive of data, contain a point you are interested in.

First make a list of all available scenes

ls data/S1/2017/??/??/*/*vv*img > all_scenes.txt

Then use gdalbuildvrt to make a VRT file containing all scenes as a separate band (assumes scenes only have a single band).

gdalbuildvrt -input_file_list all_scenes.txt \
             -separate -o s1_all_scenes.vrt

Use gdallocation info with ‘-lifonly’ flag to the scene the point we are interested in (1.5 N, 103 E) is within and redirect to a text file.

gdallocationinfo -lifonly -wgs84 s1_all_scenes.vrt \
                 103 1.5 > selected_scenes.txt

This will show if within the bounding box of the scene. To find scenes which have data can use gdallocationinfo again but with the ‘-valonly’ flag and save the values to a text file.

gdallocationinfo -valonly -wgs84 s1_all_scenes.vrt \
                  103 1.5 > selected_scenes_values.txt

Can then subset the original list of files with the values file using Python:

import numpy
# Read in data to numpy arrays
scenes = numpy.genfromtxt("selected_scenes.txt", dtype=str)
values = numpy.genfromtxt("selected_scenes_values.txt")

# Select only scenes where the value is not 0
scenes_data = scenes[values != 0]

# Save to text file
numpy.savetxt("selected_scenes_data.txt", scenes_data, fmt="%s")

If you have a very large archive of data and need to find which scenes intersect a point often, having a spatial database with the scene outlines would be a better approach. However, if it isn’t something you do often this quick approach using only the GDAL utilities and a bit of Python is worth knowing.

Introduction to RSGISLib Training Course

A course introducing RSGISLib was recently given in Japan by Pete Bunting. Material from the course is available to download from the following links:

The course covers pre-processing of PALSAR SAR data followed by a demonstration of pixel-based and object-based classification. All the required data and scripts are included in the package.

If you have any questions on the course please use the RSGISLib support Google group: https://groups.google.com/forum/#!forum/rsgislib-support

Convert LaTeX to Word

A big problem with writing in LaTeX is collaborating with colleagues who don’t use it. One option is to generate a Word .docx version and use the comments and track changes features in Word / LibreOffice. This does require manually copying the changes back to LaTeX so isn’t quite as nice as using latexdiff (see earlier post) but is slightly easier than adding comments to a PDF.

The best program I’ve found for converting LaTeX to Word is the open source (GPL) command line tool, pandoc (http://pandoc.org/).

Basic usage is quite straightforward:

pandoc latex_document.tex -o latex_document_word_version.docx

The conversion isn’t perfect, figures and tables can get a bit mangled, but it does a good job with the text.

Pandoc can convert between many different formats, including from markdown and reStructuredText (commonly used for software documentation) so it is worth having installed.

Convert Sentinel-2 Data Using GDAL

The latest version of GDAL (2.1) has a driver to read Sentinel 2 data (see http://gdal.org/frmt_sentinel2.html). Like HDF files they are read as subdatasets. Running ‘gdalinfo’ on the zipped folder or the .xml file contained within the .SAFE directory will display all the subdatasets, as well as all the metadata (so quite a lot of information!).

As with HDF5 you can pass the subdataset names to gdalinfo to get more information or gdal_translate to extract them as a separate dataset.

To make it easier to extract all the subdatasets I wrote a script (extract_s2_data.py) which can be downloaded from the https://bitbucket.org/petebunting/rsgis_scripts repository.

The script will get a list of all subdatasets using GDAL:

from osgeo import gdal
dataset = gdal.Open('S2/S2.xml', gdal.GA_ReadOnly)
subdatasets = dataset.GetSubDatasets()
dataset = None

The ones to be extracted are for the 10, 20 and 60 m resolution band groups for each UTM zone (if the file crosses multiple zones).

For each subdataset it will give an output name, replacing the EPSG code with the UTM zone and ‘:’ with ‘_’.

Then the gdal_translate command is used to create a new file for each. By default the output is KEA format, called using subprocess.

To run the script first install GDAL 2.1, the conda-forge channel has recent builds, to install them using conda:

conda create -n gdal2 -c conda-forge gdal
source activate gdal2

(If you are on Windows leave out ‘source’)

To extract all subdatasets from a zipped Sentinel 2 scene to the current directory you can then use:

extract_s2_data.py -o . \
S2A_OPER_PRD_MSIL1C_PDMC_20151201T144038_R010_V20151130T142545_20151130T142545.zip

The gdal_translate command used is printed to the screen.

The default output format is KEA, you can change using the ‘–of’ flag. For example to convert an unzipped scene to GeoTiff:

extract_s2_data.py -o . --of GTiff \
S2A_OPER_PRD_MSIL1C_PDMC_20151201T144038_R010_V20151130T142545_20151130T142545.SAFE

To get the extension for all supported drivers, and some creation options the ‘get_gdal_drivers’ module from arsf_dem_scripts is optionally used. You can just download this file and copy into the same directory ‘extract_s2_data.py’ has been saved to. For Linux or OS X you can run:

# OS X
curl https://raw.githubusercontent.com/pmlrsg/arsf_dem_scripts/master/arsf_dem/get_gdal_drivers.py > get_gdal_drivers.py

# Linux
wget https://raw.githubusercontent.com/pmlrsg/arsf_dem_scripts/master/arsf_dem/get_gdal_drivers.py

Working with full waveform LiDAR data in SPDLib (part 1)

The SPD file format was designed around storing LiDAR pulses with digitised waveforms and associated points. The most recent version (3.3) has the ability to import waveform data from LAS files using LASlib, which is part of LAStools. Binaries are available for Linux, macOS and Windows, they can be installed through conda.

conda create -n spdlib -c conda-forge spdlib tuiview
. activate spdlib
conda update -c conda-forge --all

For this example LAS 1.3 files acquired by the NERC Airborne Research Facility (NERC-ARF, previously ARSF) over Borneo using a Leica ALS50-II instrument with full waveform digitiser will be used.

Once you have registered with CEDA and applied for access to the ARSF archive these data can be downloaded from: http://browse.ceda.ac.uk/browse/neodc/arsf/2014/RG13_06/RG13_06-2014_303_Maliau_Basin/LiDAR/flightlines/fw_laser/las1.3

You can also follow through with any of the other NERC-ARF datasets or other waveform LAS files you have.

  1. Convert LAS to SPD format
    First you need a WKT file to define the projection. This step is optional but is more reliable than reading from a LAS file. For the example the projection is UTM50N, you can download a WKT file using:

    wget https://bitbucket.org/petebunting/rsgis_scripts/raw/c8cf94528cdb58b753029df3bc631a2509740ad1/WKT/UTM_WGS84/UTM_WGS84_Z50_N.wkt
    

    Then convert to an unsorted SPD file (UPD).

    spdtranslate --if LAS --of UPD \
                 -x LAST_RETURN \
                 --input_proj UTM_WGS84_Z50_N.wkt \
                 -i LDR-FW-RG13_06-2014-303-05.LAS \
                 -o LDR-FW-RG13_06-2014-303-05.spd
    
  2. Subset SPD file (optional)

    As full waveform processing is quite intensive it is recommended to subset the data for the purpose of running though this tutorial, you can do this using the spdsubset command.

    spdsubset --xmin 494400 --ymin 524800 \
              --xmax 494800 --ymax 525000 \
              -i LDR-FW-RG13_06-2014-303-05.spd \
              -o LDR-FW-RG13_06-2014-303-05_subset.spd
    
  3. Decompose waveform

    One of the limitations of discrete systems is there is are only a given number of ‘points’ recorded (normally 2 – 4) and the rest of the information is lost. As full waveform data records the entire waveform it is possible to extract more returns after data are acquired. A common approach to this ‘Gaussian Decomposition’ which involves fitting Gaussian distributions to the peaks, within SPDLib this is available as the ‘spddecomp’ command.

    spddecomp --all --noise --threshold 25 \
              -i  LDR-FW-RG13_06-2014-303-05_subset.spd \
              -o  LDR-FW-RG13_06-2014-303-05_subset_decomp.spd
    

    This will still take around 5 minutes to run. If you decide to decompose the full dataset after, expect it to take an hour or so.

  4. Export returns to LAS file
    The final step for this part of the tutorial is to export the returns to a LAS file, using the spdtranslate command.

    spdtranslate --if SPD --of LAS \
                 -i LDR-FW-RG13_06-2014-303-05_subset_decomp.spd \
                 -o LDR-FW-RG13_06-2014-303-05_subset_decomp.las
    

Classifying ground returns and calculating LiDAR metrics using SPDLib are covered in part two

If you have further questions about using SPDLib please contact the mailing list (details available from https://lists.sourceforge.net/lists/listinfo/spdlib-develop). If you have any questions about working with NERC-ARF data contact the NERC-ARF Data Analysis Node (NERC-ARF-DAN) see https://nerc-arf-dan.pml.ac.uk/ or follow us on twitter: @NERC_ARF_DAN).

Create a CSV with the coordinates from geotagged photos

Recently I took a lot of photos on my phone during fieldwork (survey for the NERC-ARF LiDAR and hyperspectral calibration flights) and wanted to extract the GPS coordinates from each photo so I could load them into QGIS using the add delimited text dialogue. The aim wasn’t to have precise locations for each photo (GPS positions for each of the points surveyed were recorded separately) but to give a quick idea of the locations we’d visited before the main GPS data were processed.

I while ago (2012!) I wrote a script for this task. The script (CreateJPEGKMZ) requires pillow and the imagemagick command line tools.

It code isn’t particularly tidy (despite my recent updates) but it basically pulls out the GPS location from the EXIF tags and writes this to a CSV file. To create a KMZ a thumbnail image is created and a corresponding KML file. Both the thumbnail and KML are then zipped together to generate the KMZ which can be opened in GoogleEarth.

To use the script to write a CSV file:

CreateJPEGKMZ.py --outcsv gps_points.csv input_jpeg_files

To also create KMZ files for each photo:

CreateJPEGKMZ.py --outcsv gps_points.csv \
                 --outkmz output_kmz_files 
                 input_jpeg_files

This was also a good lesson in the benefits of having scripts in version control and publicly available.