Tag Archives: Random Forests

Object-based classification using Random Forests

In our recent paper on an open source system for object based image classification [1] we mentioned linking with scikit-learn [2] to apply different classification algorithms. This post presents an example using Random Forests to give an idea of all the steps required. Random Forests is an ensamble learning algorithm which utilises many decision trees, each of which ‘vote’ for the final class. To reduce correlation between trees only a subset of predictors (data layers) and training samples are used for each class. Random forests is increasing in popularity within remote sensing, an example of usage is the pixel-based classification of Alaska by Whitcombe et al 2009 [3]

Taking [3] as an example this post demonstrates how the classification can be applied at the object level.

  1. Install software
  2. As detailed in previous posts the software required can be installed under Linux/OS X through conda using

    conda install -c osgeo \
      rsgislib rios tuiview scikit-learn
    
  3. Define training data
  4. Random Forests works best with a large number of training samples. For this example we’re using the National Wetlands Inventory (NWI) which are available as an ESRI Shapefile. For each class an integer code (WETCLASS) has been assigned and stored within the attribute table. The polygons are converted to a raster, where the pixel ID is the class using:

    gdal_rasterize -of KEA -ot Byte -a WETCLASS \
         -tr 100 100 nwi_shp.shp nwi_raster.kea
    
  5. Stack bands
  6. To stack the bands (assumed to have the same projection and resolution) the stack bands command, within RSGISLib, is used.

    #/usr/bin/env python
    import rsgislib
    from rsgislib import imageutils
    
    # Segmentation files
    palsarList = ['palsar_hh.kea',
                'also_hv.kea']
    
    # All files
    datalayersList = palsarList + \
                     ['palsar_hh_tex.kea',
                      'palsar_hv_tex.kea',
                      'elevation.kea',
                      'slope.kea']
    
    # Band names
    bandNamesList = ['hh','hv',
                     'hh_tex','hv_tex'
                     'elevation', 'slope']
    
    # Out file type and format
    gdalformat = 'KEA'
    dataType = rsgislib.TYPE_32FLOAT
    
    # Create stack for segmentation
    outputImage = 'palsar_hhhv.kea'
    imageutils.stackImageBands(palsarList, None, \
         outputImage, None, 0, gdalformat, dataType)
    
    # Create stack of all data layers
    outputDataStack = 'classification_stack.kea'
    imageutils.stackImageBands(datalayersList, bandNamesList, \
         outputDataStack, None, 0, gdalformat, dataType)
    

    Two stacks are created, one for the classification, which contains all data layers, and a second for segmentation which contains only SAR data. As the training data raster is categorical it is kept as a separate layer, because a separate function is required to attribute the segments.

  7. Image segmentation
  8. The following code is used for segmentation (for more detail see earlier post)

    #/usr/bin/env python
    from rsgislib.segmentation import segutils
     
    inputImage = 'palsar_hhhv.kea'
    clumpsFile = 'palsar_hhhv_clumps_elim_final.kea'
    meanImage = 'palsar_hhhv_clumps_elim_final_mean.kea'
    
    # Run segmentation
    segutils.runShepherdSegmentation(inputImage, clumpsFile,
                        meanImage, numClusters=100, minPxls=100)
    

    The method is similar to that described in an earlier post, with the addition of a step to include categorical data.

  9. Attribute segments
  10. #!/usr/bin/env python
    
    from rsgislib import rastergis
    from rsgislib.rastergis import ratutils
    from rsgislib import imageutils
     
    dataStack = 'classification_stack.kea'
    classFile = 'nwi_raster.kea'
    clumpsFile = 'palsar_hhhv_clumps_elim_final.kea'
     
    # Attribute segments with data
    ratutils.populateImageStats(dataStack, clumpsFile,
                        calcMean=True)
    
    # Convert training data to RAT
    codeStats = list()
    codeStats.append(rastergis.BandAttStats(band=1, minField='wetCode'))
    rastergis.populateRATWithStats(classFile, classFile, codeStats)
    
    # Attribute segments with class
    rastergis.strClassMajority(clumpsFile, classFile, \
                  'wetCode', 'wetCode', False)
    
  11. Perform classification
  12. Much of the classification code is sorting the data from the RAT so it can be input into scikit-learn.

    #!/usr/bin/env python
    from rios import rat
    import osgeo.gdal as gdal
    import numpy as np
    from sklearn.ensemble import RandomForestClassifier
    
    # Open RAT
    inRatFile = 'palsar_hhhv_clumps_elim_final.kea'
    ratDataset = gdal.Open(inRatFile, gdal.GA_Update)
    
    # Set column names
    x_col_names = ['hh','hv',
                   'hh_tex','hv_tex'
                   'elevation', 'slope']
    
    y_col_name = 'wetCode'
    
    # Set up list to hold data
    X = []
    
    # Read in data from each column
    for colName in x_col_names:
        X.append(rat.readColumn(ratDataset, colName))
    
    # Read in training data
    y = rat.readColumn(ratDataset, y_col_name) 
    # Set NA values to 0
    y = np.where(y == b'NA',0,y)
    y = y.astype(np.int16)
    
    X.append(y)
    
    X = np.array(X)
    X = X.transpose()
    
    # Remove rows with 0 (NA) for wetCode
    X_train = X[X[:,-1] != 0]
    
    # Remove non-finite values
    X_train = X_train[np.isfinite(X_train).all(axis=1)]
    
    # Split into variables (X) and class (y)
    y_train = X_train[:,-1]
    X_train = X_train[:,0:-1]
    
    # Train Random Forests
    clf = RandomForestClassifier(n_estimators=500, max_features=3, \
          oob_score=True, n_jobs=6, verbose=2)
    
    clf.fit(X_train, y_train)
    
    # Set NaN values to 0
    X = np.where(np.isfinite(X),X,0)
    
    # Apply classification
    predictClass = clf.predict(X[:,0:-1])
    
    # Write out data to RAT
    rat.writeColumn(ratDataset, 'predictClass', predictClass)
    ratDataset = None
    

There are other algorithms in scikit-learn which can also be applied instead of Random Forests once the data is in the correct format. The big advantage of this system is the entire process can be applied within a single Python script so multiple algorithms / parameters can be easily tested and the performance evaluated.

References

[1] Clewley, D.; Bunting, P.; Shepherd, J.; Gillingham, S.; Flood, N.; Dymond, J.; Lucas, R.; Armston, J.; Moghaddam, M. A Python-Based Open Source System for Geographic Object-Based Image Analysis (GEOBIA) Utilizing Raster Attribute Tables. Remote Sensing 2014, 6, 6111-6135.
[2] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[3] Whitcomb, J., Moghaddam, M., McDonald, K., Podest, E., Kellndorfer, J., Wetlands Map of Alaska Using L-Band Radar Satellite Imagery, Canadian Journal of Remote Sensing, 2009, Vol. 35, pp. 54-72