In our recent paper on an open source system for object based image classification [1] we mentioned linking with scikit-learn [2] to apply different classification algorithms. This post presents an example using Random Forests to give an idea of all the steps required. Random Forests is an ensamble learning algorithm which utilises many decision trees, each of which ‘vote’ for the final class. To reduce correlation between trees only a subset of predictors (data layers) and training samples are used for each class. Random forests is increasing in popularity within remote sensing, an example of usage is the pixel-based classification of Alaska by Whitcombe et al 2009 [3]
Taking [3] as an example this post demonstrates how the classification can be applied at the object level.
- Install software
- Define training data
- Stack bands
- Image segmentation
- Attribute segments
- Perform classification
As detailed in previous posts the software required can be installed under Linux/OS X through conda using
conda install -c osgeo \ rsgislib rios tuiview scikit-learn
Random Forests works best with a large number of training samples. For this example we’re using the National Wetlands Inventory (NWI) which are available as an ESRI Shapefile. For each class an integer code (WETCLASS) has been assigned and stored within the attribute table. The polygons are converted to a raster, where the pixel ID is the class using:
gdal_rasterize -of KEA -ot Byte -a WETCLASS \ -tr 100 100 nwi_shp.shp nwi_raster.kea
To stack the bands (assumed to have the same projection and resolution) the stack bands command, within RSGISLib, is used.
#/usr/bin/env python import rsgislib from rsgislib import imageutils # Segmentation files palsarList = ['palsar_hh.kea', 'also_hv.kea'] # All files datalayersList = palsarList + \ ['palsar_hh_tex.kea', 'palsar_hv_tex.kea', 'elevation.kea', 'slope.kea'] # Band names bandNamesList = ['hh','hv', 'hh_tex','hv_tex' 'elevation', 'slope'] # Out file type and format gdalformat = 'KEA' dataType = rsgislib.TYPE_32FLOAT # Create stack for segmentation outputImage = 'palsar_hhhv.kea' imageutils.stackImageBands(palsarList, None, \ outputImage, None, 0, gdalformat, dataType) # Create stack of all data layers outputDataStack = 'classification_stack.kea' imageutils.stackImageBands(datalayersList, bandNamesList, \ outputDataStack, None, 0, gdalformat, dataType)
Two stacks are created, one for the classification, which contains all data layers, and a second for segmentation which contains only SAR data. As the training data raster is categorical it is kept as a separate layer, because a separate function is required to attribute the segments.
The following code is used for segmentation (for more detail see earlier post)
#/usr/bin/env python from rsgislib.segmentation import segutils inputImage = 'palsar_hhhv.kea' clumpsFile = 'palsar_hhhv_clumps_elim_final.kea' meanImage = 'palsar_hhhv_clumps_elim_final_mean.kea' # Run segmentation segutils.runShepherdSegmentation(inputImage, clumpsFile, meanImage, numClusters=100, minPxls=100)
The method is similar to that described in an earlier post, with the addition of a step to include categorical data.
#!/usr/bin/env python from rsgislib import rastergis from rsgislib.rastergis import ratutils from rsgislib import imageutils dataStack = 'classification_stack.kea' classFile = 'nwi_raster.kea' clumpsFile = 'palsar_hhhv_clumps_elim_final.kea' # Attribute segments with data ratutils.populateImageStats(dataStack, clumpsFile, calcMean=True) # Convert training data to RAT codeStats = list() codeStats.append(rastergis.BandAttStats(band=1, minField='wetCode')) rastergis.populateRATWithStats(classFile, classFile, codeStats) # Attribute segments with class rastergis.strClassMajority(clumpsFile, classFile, \ 'wetCode', 'wetCode', False)
Much of the classification code is sorting the data from the RAT so it can be input into scikit-learn.
#!/usr/bin/env python from rios import rat import osgeo.gdal as gdal import numpy as np from sklearn.ensemble import RandomForestClassifier # Open RAT inRatFile = 'palsar_hhhv_clumps_elim_final.kea' ratDataset = gdal.Open(inRatFile, gdal.GA_Update) # Set column names x_col_names = ['hh','hv', 'hh_tex','hv_tex' 'elevation', 'slope'] y_col_name = 'wetCode' # Set up list to hold data X = [] # Read in data from each column for colName in x_col_names: X.append(rat.readColumn(ratDataset, colName)) # Read in training data y = rat.readColumn(ratDataset, y_col_name) # Set NA values to 0 y = np.where(y == b'NA',0,y) y = y.astype(np.int16) X.append(y) X = np.array(X) X = X.transpose() # Remove rows with 0 (NA) for wetCode X_train = X[X[:,-1] != 0] # Remove non-finite values X_train = X_train[np.isfinite(X_train).all(axis=1)] # Split into variables (X) and class (y) y_train = X_train[:,-1] X_train = X_train[:,0:-1] # Train Random Forests clf = RandomForestClassifier(n_estimators=500, max_features=3, \ oob_score=True, n_jobs=6, verbose=2) clf.fit(X_train, y_train) # Set NaN values to 0 X = np.where(np.isfinite(X),X,0) # Apply classification predictClass = clf.predict(X[:,0:-1]) # Write out data to RAT rat.writeColumn(ratDataset, 'predictClass', predictClass) ratDataset = None
There are other algorithms in scikit-learn which can also be applied instead of Random Forests once the data is in the correct format. The big advantage of this system is the entire process can be applied within a single Python script so multiple algorithms / parameters can be easily tested and the performance evaluated.
References
[1] Clewley, D.; Bunting, P.; Shepherd, J.; Gillingham, S.; Flood, N.; Dymond, J.; Lucas, R.; Armston, J.; Moghaddam, M. A Python-Based Open Source System for Geographic Object-Based Image Analysis (GEOBIA) Utilizing Raster Attribute Tables. Remote Sensing 2014, 6, 6111-6135.
[2] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[3] Whitcomb, J., Moghaddam, M., McDonald, K., Podest, E., Kellndorfer, J., Wetlands Map of Alaska Using L-Band Radar Satellite Imagery, Canadian Journal of Remote Sensing, 2009, Vol. 35, pp. 54-72