In our recent paper on an open source system for object based image classification [1] we mentioned linking with scikit-learn [2] to apply different classification algorithms. This post presents an example using Random Forests to give an idea of all the steps required. Random Forests is an ensamble learning algorithm which utilises many decision trees, each of which ‘vote’ for the final class. To reduce correlation between trees only a subset of predictors (data layers) and training samples are used for each class. Random forests is increasing in popularity within remote sensing, an example of usage is the pixel-based classification of Alaska by Whitcombe et al 2009 [3]
Taking [3] as an example this post demonstrates how the classification can be applied at the object level.
- Install software
As detailed in previous posts the software required can be installed under Linux/OS X through conda using
conda install -c osgeo \
rsgislib rios tuiview scikit-learn
- Define training data
Random Forests works best with a large number of training samples. For this example we’re using the National Wetlands Inventory (NWI) which are available as an ESRI Shapefile. For each class an integer code (WETCLASS) has been assigned and stored within the attribute table. The polygons are converted to a raster, where the pixel ID is the class using:
gdal_rasterize -of KEA -ot Byte -a WETCLASS \
-tr 100 100 nwi_shp.shp nwi_raster.kea
- Stack bands
To stack the bands (assumed to have the same projection and resolution) the stack bands command, within RSGISLib, is used.
#/usr/bin/env python
import rsgislib
from rsgislib import imageutils
# Segmentation files
palsarList = ['palsar_hh.kea',
'also_hv.kea']
# All files
datalayersList = palsarList + \
['palsar_hh_tex.kea',
'palsar_hv_tex.kea',
'elevation.kea',
'slope.kea']
# Band names
bandNamesList = ['hh','hv',
'hh_tex','hv_tex'
'elevation', 'slope']
# Out file type and format
gdalformat = 'KEA'
dataType = rsgislib.TYPE_32FLOAT
# Create stack for segmentation
outputImage = 'palsar_hhhv.kea'
imageutils.stackImageBands(palsarList, None, \
outputImage, None, 0, gdalformat, dataType)
# Create stack of all data layers
outputDataStack = 'classification_stack.kea'
imageutils.stackImageBands(datalayersList, bandNamesList, \
outputDataStack, None, 0, gdalformat, dataType)
Two stacks are created, one for the classification, which contains all data layers, and a second for segmentation which contains only SAR data. As the training data raster is categorical it is kept as a separate layer, because a separate function is required to attribute the segments.
- Image segmentation
The following code is used for segmentation (for more detail see earlier post)
#/usr/bin/env python
from rsgislib.segmentation import segutils
inputImage = 'palsar_hhhv.kea'
clumpsFile = 'palsar_hhhv_clumps_elim_final.kea'
meanImage = 'palsar_hhhv_clumps_elim_final_mean.kea'
# Run segmentation
segutils.runShepherdSegmentation(inputImage, clumpsFile,
meanImage, numClusters=100, minPxls=100)
The method is similar to that described in an earlier post, with the addition of a step to include categorical data.
- Attribute segments
#!/usr/bin/env python
from rsgislib import rastergis
from rsgislib.rastergis import ratutils
from rsgislib import imageutils
dataStack = 'classification_stack.kea'
classFile = 'nwi_raster.kea'
clumpsFile = 'palsar_hhhv_clumps_elim_final.kea'
# Attribute segments with data
ratutils.populateImageStats(dataStack, clumpsFile,
calcMean=True)
# Convert training data to RAT
codeStats = list()
codeStats.append(rastergis.BandAttStats(band=1, minField='wetCode'))
rastergis.populateRATWithStats(classFile, classFile, codeStats)
# Attribute segments with class
rastergis.strClassMajority(clumpsFile, classFile, \
'wetCode', 'wetCode', False)
- Perform classification
Much of the classification code is sorting the data from the RAT so it can be input into scikit-learn.
#!/usr/bin/env python
from rios import rat
import osgeo.gdal as gdal
import numpy as np
from sklearn.ensemble import RandomForestClassifier
# Open RAT
inRatFile = 'palsar_hhhv_clumps_elim_final.kea'
ratDataset = gdal.Open(inRatFile, gdal.GA_Update)
# Set column names
x_col_names = ['hh','hv',
'hh_tex','hv_tex'
'elevation', 'slope']
y_col_name = 'wetCode'
# Set up list to hold data
X = []
# Read in data from each column
for colName in x_col_names:
X.append(rat.readColumn(ratDataset, colName))
# Read in training data
y = rat.readColumn(ratDataset, y_col_name)
# Set NA values to 0
y = np.where(y == b'NA',0,y)
y = y.astype(np.int16)
X.append(y)
X = np.array(X)
X = X.transpose()
# Remove rows with 0 (NA) for wetCode
X_train = X[X[:,-1] != 0]
# Remove non-finite values
X_train = X_train[np.isfinite(X_train).all(axis=1)]
# Split into variables (X) and class (y)
y_train = X_train[:,-1]
X_train = X_train[:,0:-1]
# Train Random Forests
clf = RandomForestClassifier(n_estimators=500, max_features=3, \
oob_score=True, n_jobs=6, verbose=2)
clf.fit(X_train, y_train)
# Set NaN values to 0
X = np.where(np.isfinite(X),X,0)
# Apply classification
predictClass = clf.predict(X[:,0:-1])
# Write out data to RAT
rat.writeColumn(ratDataset, 'predictClass', predictClass)
ratDataset = None
There are other algorithms in scikit-learn which can also be applied instead of Random Forests once the data is in the correct format. The big advantage of this system is the entire process can be applied within a single Python script so multiple algorithms / parameters can be easily tested and the performance evaluated.
References
[1] Clewley, D.; Bunting, P.; Shepherd, J.; Gillingham, S.; Flood, N.; Dymond, J.; Lucas, R.; Armston, J.; Moghaddam, M. A Python-Based Open Source System for Geographic Object-Based Image Analysis (GEOBIA) Utilizing Raster Attribute Tables. Remote Sensing 2014, 6, 6111-6135.
[2] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[3] Whitcomb, J., Moghaddam, M., McDonald, K., Podest, E., Kellndorfer, J., Wetlands Map of Alaska Using L-Band Radar Satellite Imagery, Canadian Journal of Remote Sensing, 2009, Vol. 35, pp. 54-72