Create Image Chips

Creating Image Chips

Introduction

When applying deep learning to geospatial data, you will need to generate image chips and associated labels of a defined size to train the algorithm. In the case of scene labeling, each of these chips will need to have an associated label. In the case of semantic segmentation, each chip will need to have an associated pixel-level mask where each class is assigned a unique numeric code. Multispectral imagery and other geospatial raster data commonly cover large spatial extents. So, you will need to be able to “chip” these data into smaller, consistently sized subsets so that they can be used to train, validation, and assess deep learning models. In order to make inferences to entire extents, you will then need to break the larger dataset into chips, predict each chip, then reassemble the predictions to obtain a single, continuous output, as we did in the prior section.

In this module, I will demonstrate functions that my lab group created in the R language and making use of the terra package for preparing data as input to deep learning semantic segmentation workflows.

There are some tools available for creating image chips and associated labels or masks from geospatial data. In the ArcGIS Pro desktop software the Export Training Data for Deep Learning tool is available with a Spatial Analyst and/or Image Analyst license. Documentation is available here: https://pro.arcgis.com/en/pro-app/latest/tool-reference/image-analyst/export-training-data-for-deep-learning.htm. This tool can generate image chips for scene labeling, object detection, semantic segmentation, and instance segmentation. It is also available in the ArcGIS API for Python. If you do not have access to ArcGIS Pro and the required extensions, the Produce Training Data for Deep Learning plugin is available for QGIS (https://plugins.qgis.org/plugins/produce_training_data_for_deep_learning/) and is open-source.

In this example, I will make use of the topoDL dataset, which is made available on the West Virginia View website: http://www.wvview.org/research.html. These data were used in the following publication, which can be accessed for free:

Maxwell, A.E., M.S. Bester, L.A. Guillen, C.A. Ramezan, D.J. Carpinello, Y. Fan, F.M. Hartley, S.M. Maynard, and J.L. Pyron, 2020. Semantic segmentation deep learning for extracting surface mine extents from historic topographic maps, Remote Sensing, 12(24): 4145. https://doi.org/10.3390/rs12244145.

The goal of this study was to use semantic segmentation methods to extract the extent of historic surface mining from 1:24,000 scale topographic maps. Note that running the code below will take some time since I use the entire dataset of 122 topographic maps. If you would like to experiment with the code, you can run the code for a single topographic map or your own data. These data will be used in the next section where we explore the Segmentation Models Python package, so you will need to execute this code and create these image chips and mask if you want to run the code in the next section.

library(dplyr)
library(stringr)
library(terra)
library(imager)

Create Masks

As part of this class, I have provided some R functions that my lab group has written to aid in generating deep learning datasets. These functions are included as .R files within the class data downloads. You can use these functions within R by executing the function code to instantiate the function then calling it with your specific data and settings. I recommend using the RStudio integrated development environment (IDE), which is free (https://posit.co/download/rstudio-desktop/).

the makeMasks() function can be used to generate raster-based masks from vector geospatial data for semantic segmentation tasks. It has the following parameters.

image: the image associated with the vector objects. The image will define the cell size, output coordinate reference system, and extent of the generated raster masks. Can be either a spatRaster object or a string representing a file on disk.
features: geospatial vector data representing extents of features of interest. We generally suggest using the shapefile format. Can be either a spatVector object or a string representing a file on disk.
extent: layer used to crop the extent of the generated raster masks and the image data (if they are exported). This should be in a vector data format. We recommend using a shapefile.
field: the name of the attribute column in the feature layer that specifies the unique code for each class. Classes should be differentiated using numeric codes as opposed to a character strings. Codes should be assigned from 0 to the number of classes minus 1 without skipping values. If you want to reserve 0 for a background class, you should begin the codes at 1. In the function call, the field name must be quoted (e.g., “code”).
background: the numeric code to assign to the background class. The default is 0. This should be unique from the other class codes unless you want the background to be included within or merged with one of the defined classes. For a binary classification problem, the background class should be coded to 0 while the class of interest should be coded to 1.
crop: if TRUE, the provided vector extent will be used to crop the image and associated raster mask. If FALSE, the extent of the data will not be cropped.
outImage: name of the output image, including full file path and file extension. We recommend the .tif or .png format be used.
outMask: name of output raster mask, including full file path and file extension. We recommend the .tif or .png format be used.
mode: two modes are available. “Both” = export both a copy of the image and the raster mask. “Mask” = just export the mask. You can choose to export the image along with the raster mask to (1) copy the file or change the file format and/or (2) to crop the image relative to the defined extent. If the raster masks are cropped relative to the defined extent, you should also export the images so that the files have the same extent and number of rows and columns of pixels.

makeMasks <- function(image, features, crop=FALSE, extent, field, background, outImage, outMask, mode="Both"){
  require(terra)

  imgData <- rast(image)
  featData <- vect(features)
  extData <- vect(extent)

  extCRS <- project(extData, imgData)
  featCRS <- project(featData, imgData)
  if(crop==TRUE){
    extCRS <- project(extData, imgData)
    imgData <- crop(imgData, vect(extent))
  }
  mineR <- rasterize(minesCRS, imgData, field=field, background=background)
  if(mode=="Both") {
    writeRaster(topo_crop, outImage)
    writeRaster(mineR, outMask)
  }else if(mode=="Mask") {
    writeRaster(mineR, outMask)
  }else{
    print("Invalid Mode.")
  }
}

Here I have provided an example of rasterizing surface disturbance extent boundaries associated with 1:24,000 scale topographic maps. In this case, I am exporting both the image and raster mask since the data are being cropped to an extent. This was done to remove the collar information from the topographic map. The “code” column specifies the numeric code associated with the class of the feature. Since there is only one class, a numeric code of 1 is used. The background, or non-surface disturbance, extents are coded to 0.

This process requires several steps. First, I list all of the topographic maps in the ky_topos folder. There are a total of 122 topographic maps in this folder, all occurring within the state of Kentucky. We have also provided topographic maps from Ohio and Virginia. However, I will not use those here. I next create a dataframe that stores (1) the name of each file, (2) the full file path to the topographic map, (3) the full file path to the vector mask data, (4) the full file path to the associate extent vector data, (5) the file path for the new output topographic map that has been cropped, and (6) the file path to the generated raster mask. I then use a for loop to loop through all rows in the table to process all topographic maps with the makeMasks() function. If you run the code, you can explore the resulting output on your local machine.

images <- list.files("C:/myFiles/work/topo_dl_data/topo_dl_data/ky_topos",
                     full.names= FALSE,
                     pattern="\\.tif$")

imgDF <- data.frame(img = substr(images, 1, nchar(images)-4))
imgDF$img_path <- paste0("C:/myFiles/work/topo_dl_data/topo_dl_data/ky_topos/", imgDF$img, ".tif")
imgDF$msk_path <- paste0("C:/myFiles/work/topo_dl_data/topo_dl_data/ky_mines/", imgDF$img, ".shp")
imgDF$ext_path <- paste0("C:/myFiles/work/topo_dl_data/topo_dl_data/ky_quads/", imgDF$img, ".shp")
imgDF$img_out <- paste0("C:/myFiles/work/topo_dl_data/topo_dl_data/processing/", "img/", imgDF$img, ".tif")
imgDF$msk_out <- paste0("C:/myFiles/work/topo_dl_data/topo_dl_data/processing/", "msk/", imgDF$img, ".tif")

for(i in 1:nrow(imgDF)) {
  makeMasks(image=imgDF[i, "img_path"], features=imgDF[i, "msk_path"],
            crop=TRUE, extent = imgDF[i, "ext_path"], field=254, background=0,
            outImage=imgDF[i, "img_out"], outMask=imgDF[i, "msk_out"],
            mode="Both")
}

Create Chips

The makeChips() function allows you to generate image chips of a defined size from an input image and associated mask. This function expects the image and mask to have the same spatial extent or number of rows and columns. Also, if a chip contains NoData or Null pixels, it is not exported. In other words, only complete chips are exported. All chips are saved to PNG format and will not have associated spatial reference information since this is not required to train the deep learning algorithm. This function has the following parameters:

image: input image. This should be exported from makeMasks() or have the same spatial extent and number of rows and columns as the associated raster mask.
mask: raster mask produced by the makeMasks() function. You can use images and raster masks that were not created using the makeMasks() function as long as each image and raster mask has the same spatial extent and the same number of rows and columns of pixels. Codes should be assigned from 0 to the number of classes minus 1 without skipping values. If you want to reserve 0 for a background class, you should begin the codes at 1.
n_channels: number of channels or layers in the input image. The default is 3.
size: size of the image chips specified as number of pixels. The default is 256, which indicates 256x256 pixels.
stride_x: the stride in the x dimension (side-to-side). If this is the same as the size parameter, there will be no overlap between image chips in the side-to-side dimension.
stride_y: the stride in the y dimension (up-and-down). If this is the same as the size parameter, there will be no overlap between image chips in the up-and-down direction.
outDir: output directory. The function will create “images” and “masks” subfolders in this directory. You should not include the final “/” in the file path.
mode: three modes are available. “All” = write out all image and mask chips even if the entire image chip consists of the background class. “Positive” = write out only image chips and associated masks if at least one pixel is not mapped to the background class. “Divided” = write out all chips but separate the background-only examples from the other chips. If this mode is used, “positive” and “background” subfolders will be created inside of the “images” and “masks” directories.

makeChips <- function(image, mask, n_channels=3, size=256, stride_x=256, stride_y=256, outDir, mode="All"){
  require(terra)
  require(imager)
  if(mode == "All"){
    img1 <- rast(image)
    mask1 <- rast(mask)

    fName = basename(image)

    dir.create(paste0(outDir, "/images"))
    dir.create(paste0(outDir, "/masks"))

    across_cnt = ncol(img1)
    down_cnt = nrow(img1)
    tile_size_across = size
    tile_size_down = size
    overlap_across = stride_x
    overlap_down = stride_y
    across <- ceiling(across_cnt/overlap_across)
    down <- ceiling(down_cnt/overlap_down)
    across_add <- (across*overlap_across)-across_cnt
    across_seq <- seq(0, across-1, by=1)
    down_seq <- seq(0, down-1, by=1)
    across_seq2 <- (across_seq*overlap_across)+1
    down_seq2 <- (down_seq*overlap_down)+1

    #Loop through row/column combinations to make predictions for entire image
    for (c in across_seq2){
      for (r in down_seq2){
        c1 <- c
        r1 <- r
        c2 <- c + (size-1)
        r2 <- r + (size-1)
        if(c2 <= across_cnt && r2 <= down_cnt){ #Full chip
          chip_data <- img1[r1:r2, c1:c2, 1:n_channels]
          mask_data <- mask1[r1:r2, c1:c2, 1]
        }else if(c2 > across_cnt && r2 <= down_cnt){ # Last column
          c1b <- across_cnt - (size-1)
          c2b <- across_cnt
          chip_data <- img1[r1:r2, c1b:c2b, 1:n_channels]
          mask_data <- mask1[r1:r2, c1b:c2b, 1]
        }else if(c2 <= across_cnt && r2 > down_cnt){ #Last row
          r1b <- down_cnt - (size-1)
          r2b <- down_cnt
          chip_data <- img1[r1b:r2b, c1:c2, 1:n_channels]
          mask_data <- mask1[r1b:r2b, c1:c2, 1]
        }else{ # Last row, last column
          c1b <- across_cnt - (size -1)
          c2b <- across_cnt
          r1b <- down_cnt - (size -1)
          r2b <- down_cnt
          chip_data <- img1[r1b:r2b, c1b:c2b, 1:n_channels]
          mask_data <- mask1[r1b:r2b, c1b:c2b, 1]
        }
        chip_data2 <- c(stack(chip_data)[,1])
        chip_array <- array(chip_data2, c(size,size,n_channels))
        image1 <- as.cimg(chip_array, x=size, y=size, cc=n_channels)
        imager::save.image(image1, paste0(outDir, "/images/", substr(fName, 1, nchar(image)-4), "_", c1, "_", r1, ".png"))
        names(mask_data) <- c("C")
        Cx <- as.vector(mask_data$C)
        mask_array <- array(Cx, c(size,size,1))
        msk1 <- as.cimg(mask_array, x=size, y=size, cc=1)
        imager::save.image(msk1, paste0(outDir, "/masks/", substr(fName, 1, nchar(image)-4), "_", c1, "_", r1, ".png"))
      }
    }
  }else if(mode == "Positive"){
    img1 <- rast(image)
    mask1 <- rast(mask)

    fName = basename(image)

    dir.create(paste0(outDir, "/images"))
    dir.create(paste0(outDir, "/masks"))

    across_cnt = ncol(img1)
    down_cnt = nrow(img1)
    tile_size_across = size
    tile_size_down = size
    overlap_across = stride_x
    overlap_down = stride_y
    across <- ceiling(across_cnt/overlap_across)
    down <- ceiling(down_cnt/overlap_down)
    across_add <- (across*overlap_across)-across_cnt
    across_seq <- seq(0, across-1, by=1)
    down_seq <- seq(0, down-1, by=1)
    across_seq2 <- (across_seq*overlap_across)+1
    down_seq2 <- (down_seq*overlap_down)+1

    #Loop through row/column combinations to make predictions for entire image
    for (c in across_seq2){
      for (r in down_seq2){
        c1 <- c
        r1 <- r
        c2 <- c + (size-1)
        r2 <- r + (size-1)
        if(c2 <= across_cnt && r2 <= down_cnt){ #Full chip
          chip_data <- img1[r1:r2, c1:c2, 1:n_channels]
          mask_data <- mask1[r1:r2, c1:c2, 1]
        }else if(c2 > across_cnt && r2 <= down_cnt){ # Last column
          c1b <- across_cnt - (size-1)
          c2b <- across_cnt
          chip_data <- img1[r1:r2, c1b:c2b, 1:n_channels]
          mask_data <- mask1[r1:r2, c1b:c2b, 1]
        }else if(c2 <= across_cnt && r2 > down_cnt){ #Last row
          r1b <- down_cnt - (size-1)
          r2b <- down_cnt
          chip_data <- img1[r1b:r2b, c1:c2, 1:n_channels]
          mask_data <- mask1[r1b:r2b, c1:c2, 1]
        }else{ # Last row, last column
          c1b <- across_cnt - (size -1)
          c2b <- across_cnt
          r1b <- down_cnt - (size -1)
          r2b <- down_cnt
          chip_data <- img1[r1b:r2b, c1b:c2b, 1:n_channels]
          mask_data <- mask1[r1b:r2b, c1b:c2b, 1]
        }
        chip_data2 <- c(stack(chip_data)[,1])
        chip_array <- array(chip_data2, c(size,size,n_channels))
        image1 <- as.cimg(chip_array, x=size, y=size, cc=n_channels)
        names(mask_data) <- c("C")
        Cx <- as.vector(mask_data$C)
        mask_array <- array(Cx, c(sizes,size,1))
        msk1 <- as.cimg(mask_array, x=size, y=size, cc=1)
        if(max(mask_array) > 0){
          imager::save.image(image1, paste0(outDir, "/images/", substr(fName, 1, nchar(image)-4), "_", c1, "_", r1, ".png"))
          imager::save.image(msk1, paste0(outDir, "/masks/", substr(fName, 1, nchar(image)-4), "_", c1, "_", r1, ".png"))
        }
      }
    }
  }else if(mode=="Divided") {
    img1 <- rast(image)
    mask1 <- rast(mask)

    fName = basename(image)

    dir.create(paste0(outDir, "/images"))
    dir.create(paste0(outDir, "/masks"))

    dir.create(paste0(outDir, "/images/positive"))
    dir.create(paste0(outDir, "/images/background"))
    dir.create(paste0(outDir, "/masks/positive"))
    dir.create(paste0(outDir, "/masks/background"))

    across_cnt <- ncol(img1)
    down_cnt <- nrow(img1)
    tile_size_across <- size
    tile_size_down <- size
    overlap_across <- stride_x
    overlap_down <- stride_y
    across <- ceiling(across_cnt/overlap_across)
    down <- ceiling(down_cnt/overlap_down)
    across_add <- (across*overlap_across)-across_cnt
    across_seq <- seq(0, across-1, by=1)
    down_seq <- seq(0, down-1, by=1)
    across_seq2 <- (across_seq*overlap_across)+1
    down_seq2 <- (down_seq*overlap_down)+1

    #Loop through row/column combinations to make predictions for entire image
    for (c in across_seq2){
      for (r in down_seq2){
        c1 <- c
        r1 <- r
        c2 <- c + (size-1)
        r2 <- r + (size-1)
        if(c2 <= across_cnt && r2 <= down_cnt){ #Full chip
          chip_data <- img1[r1:r2, c1:c2, 1:n_channels]
          mask_data <- mask1[r1:r2, c1:c2, 1]
        }else if(c2 > across_cnt && r2 <= down_cnt){ # Last column
          c1b <- across_cnt - (size-1)
          c2b <- across_cnt
          chip_data <- img1[r1:r2, c1b:c2b, 1:n_channels]
          mask_data <- mask1[r1:r2, c1b:c2b, 1]
        }else if(c2 <= across_cnt && r2 > down_cnt){ #Last row
          r1b <- down_cnt - (size-1)
          r2b <- down_cnt
          chip_data <- img1[r1b:r2b, c1:c2, 1:n_channels]
          mask_data <- mask1[r1b:r2b, c1:c2, 1]
        }else{ # Last row, last column
          c1b <- across_cnt - (size -1)
          c2b <- across_cnt
          r1b <- down_cnt - (size -1)
          r2b <- down_cnt
          chip_data <- img1[r1b:r2b, c1b:c2b, 1:n_channels]
          mask_data <- mask1[r1b:r2b, c1b:c2b, 1]
        }
        chip_data2 <- c(stack(chip_data)[,1])
        chip_array <- array(chip_data2, c(size,size,n_channels))
        image1 <- as.cimg(chip_array, x=size, y=size, cc=n_channels)
        names(mask_data) <- c("C")
        Cx <- as.vector(mask_data$C)
        mask_array <- array(Cx, c(size,size,1))
        msk1 <- as.cimg(mask_array, x=size, y=size, cc=1)
        if(max(mask_array) > 0){
          imager::save.image(image1, paste0(outDir, "/images/positive/", substr(fName, 1, nchar(image)-4), "_", c1, "_", r1, ".png"))
          imager::save.image(msk1, paste0(outDir, "/masks/positive/", substr(fName, 1, nchar(image)-4), "_", c1, "_", r1, ".png"))
        }else{
          imager::save.image(image1, paste0(outDir, "/images/background/", substr(fName, 1, nchar(image)-4), "_", c1, "_", r1, ".png"))
          imager::save.image(msk1, paste0(outDir, "/masks/background/", substr(fName, 1, nchar(image)-4), "_", c1, "_", r1, ".png"))
        }
      }
    }
  } else {
    print("Invalid Mode Provided.")
  }
}

In the example, I have generated chips from the topographic maps and associated raster masks that were generated above with the makeMasks() function. I begin by defining strings representing the folder paths to the input images and masks and the output directory in which the chips will be stored. I then list all of the images and masks in the directories. The remaining code in this section is used to randomly split the topographic maps into separate training, validation, and testing sets. This is designed such that all chips from the same topographic map will be assigned to the same data partition. I also make sure that all files representing the same topographic map but different dates are written to the same data partition. Please read through the comments in the code for more details.

imgPth <- "C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\img\\"
mskPth <- "C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\msk\\"
outDir <- "C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\chips\\"

#List all topographic maps and associated masks
images <- list.files(imgPth, pattern="\\.tif$")
masks <- list.files(mskPth, pattern="\\.tif$")

#Merge the image and mask file names to a dataframe
input_topos <- data.frame(Images = images, Masks = masks)

#Create new columns that add the file paths to the image and mask file names
input_topos$img_full <- paste0(imgPth, input_topos$Images)
input_topos$msk_full <- paste0(mskPth, input_topos$Masks)

#Loop to extract components of file names to columns
topo_prep <- data.frame()
for(i in 1:nrow(input_topos)) {
  ky_All <- str_split(input_topos[i, 1], "_", simplify=TRUE)
  topo_prep <- rbind(topo_prep, ky_All)
}

#Rename columns
names(topo_prep) <- c("STATE", "NAME", "SCANID", "YEAR", "SCALE", "GEO")

#Merge columns representing the name components back to the original dataframe
input_topos2 <- cbind(input_topos, topo_prep)

#Define the STATE column as a factor
input_topos2$STATE <- as.factor(input_topos2$STATE)
ky_topos <- input_topos2 %>% dplyr::filter(STATE == "KY")

#List all topo names
quad_names <- as.data.frame(levels(as.factor(ky_topos$NAME)))
names(quad_names) <- "NAME"

#Split quads into training, validation, and testing partitions
set.seed(42)
topos_train <- quad_names %>% sample_n(70)
topos_remaining <- setdiff(quad_names, topos_train)
set.seed(43)
topos_val <- topos_remaining %>% sample_frac(.5)
topos_test <- setdiff(topos_remaining, topos_val)

#Assign a unique code to the training, validatation and testing partitions
topos_train$select <- 1
topos_val$select <- 2
topos_test$select <- 3
topos_combined <- rbind(topos_train, topos_val, topos_test)

#Join sampling results back to folder list
ky_topos2 <- left_join(ky_topos, topos_combined, by="NAME")

#Separate into training, validation, and testing splits
train_topos <- ky_topos2 %>% filter(select==1)
val_topos <- ky_topos2 %>% filter(select==2)
test_topos <- ky_topos2 %>% filter(select==3)

I then save the data partitions and associated information to CSV files, read the files back in, then use the information in the tables to process all of the data within for loops. When creating image chips using the makeChips() function, each chip has dimensions of 256x256 pixels. Since the strides are the same as the chip size, there will be no overlap between image chips. I am also using the “Divided” method so that the background-only and presence chips are written to separate directories.

write.csv(val_topos, "C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\val_topos.csv")
write.csv(train_topos, "C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\train_topos.csv")
write.csv(test_topos, "C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\test_topos.csv")

val_topos <- read.csv("C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\val_topos.csv")
train_topos <- read.csv("C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\train_topos.csv")
test_topos <- read.csv("C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\test_topos.csv")

for(t in 1:nrow(train_topos)){
  makeChips(image= train_topos[t, c("img_full")],
         mask= train_topos[t, c("msk_full")],
         n_channels=3,
         size=256, stride_x=256, stride_y=256,
         outDir= paste0(outDir, "train"),
         mode="Divided")
}

for(t in 1:nrow(test_topos)){
  makeChips(image= test_topos[t, c("img_full")],
         mask= test_topos[t, c("msk_full")],
         n_channels=3,
         size=256, stride_x=256, stride_y=256,
         outDir= paste0(outDir, "test"),
         mode="Divided")
}

for(t in 1:nrow(val_topos)){
  makeChips(image= val_topos[t, c("img_full")],
         mask= val_topos[t, c("msk_full")],
         n_channels=3,
         size=256, stride_x=256, stride_y=256,
         outDir= paste0(outDir, "val"),
         mode="Divided")
}

Create Chip Table

The makeChipsDF() function is used to generate a dataframe that lists all of the chips and associated masks stored in a directory.

folder: folder path and folder name for the folder containing the image chips. You should include the final “/” in the folder path.
outCSV: file path and file name of the output CSV file with the .csv file extension.
extension: file extension for the image chips and associated masks. The default is .png.
mode: either “All”, “Positive”, or “Divided”. See explanations above for the makeChips() function.
shuffle: whether or not to shuffle the rows in the table. This can be used to randomize the chips. However, you can always shuffle the rows later. FALSE indicates to not shuffle the rows while TRUE indicates to shuffle the rows. The default is FALSE.
saveCSV: if TRUE, save a CSV file to disk. If FALSE, only creates a dataframe and does not save the table out to disk. The default is FALSE. If FALSE, the outCSV parameter is ignored.

The resulting table will contain the following columns:

chp = name of image chip
chpPth = full path and file name of image chip
mskPth = full path and file name of associated raster mask
If mode = “Divided” an additional “division” column will be included. “Positive” indicates that this chip is an example from the positive class while “Background” means that only background pixels are included in the image chip. This information can be used to filter, subset, or sample chips based on whether they contain only background pixels or not.

makeChipsDF <- function(folder, outCSV, extension, mode="All", shuffle=FALSE, saveCSV=FALSE){
  if(mode == "All" | mode == "Positive"){
    lstChps <- list.files(paste0(folder, "images/"), pattern=paste0("\\", extension, "$"))
    lstChpsPth <- paste0(folder, "images/", lstChps)
    lstMsksPth <- paste0(folder, "masks/", lstChps)
    chpDF <- data.frame(chp=lstChips, chpPth=lstChpsPth, mskPth=lstMsksPth)
  }else{
    lstChpsB <- list.files(paste0(folder, "images/background/"), pattern=paste0("\\", extension, "$"))
    lstChpsP <- list.files(paste0(folder, "images/positive/"), pattern=paste0("\\", extension, "$"))
    lstChpsPthB <- paste0(folder, "images/background/", lstChpsB)
    lstMsksPthB <- paste0(folder, "masks/background/", lstChpsB)
    lstChpsPthP <- paste0(folder, "images/positive/", lstChpsP)
    lstMsksPthP <- paste0(folder, "masks/positive/", lstChpsP)
    chpDFB <- data.frame(chpN=lstChpsB, chpPath=lstChpsPthB, mskPth=lstMsksPthB)
    chpDFP <- data.frame(chpN=lstChpsP, chpPath=lstChpsPthP, mskPth=lstMsksPthP)
    chpDFP$division <- "Postive"
    chpDFB$division <- "Backround"
    chpDF <- bind_rows(chpDFB, chpDFP)
  }
  if(shuffle == TRUE){
    chpDF <- chpDF %>% sample_n(nrow(chpDF), replace=FALSE)
  }
  if(saveCSV == TRUE){
    write.csv(chpDF, outCSV)
  }
  return(chpDF)
}

Below, I have created the dataframes for the training, testing, and validation data. I indicate that the file extension is “png” and the mode was “Divided”. CSV files are written to disk and the rows are shuffled. Lastly, I print the first 6 rows of the training dataframe as a check.

trainDF <- makeChipsDF("C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\chips\\train\\", 
            "C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\trainDF.csv", 
            "png", mode="Divided", shuffle=TRUE, saveCSV=TRUE)

testDF <- makeChipsDF("C:/myFiles/work/topo_dl_data/topo_dl_data/processing/chips/test/", 
            "C:/myFiles/work/topo_dl_data/topo_dl_data/processing/testDF.csv", 
            "png", mode="Divided", shuffle=TRUE, saveCSV=TRUE)

valDF <- makeChipsDF("C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\chips\\val\\", 
            "C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\valDF.csv", 
            "png", mode="Divided", shuffle=TRUE, saveCSV=TRUE)

head(trainDF)

                                                 chpN
1    KY_Corbin_708433_1961_24000_geo.tif_2561_769.png
2   KY_Kayjay_709002_1959_24000_geo.tif_5377_6145.png
3     KY_Meta_709277_1978_24000_geo.tif_2817_4609.png
4 KY_Barthell_803308_1954_24000_geo.tif_3329_5121.png
5   KY_Thomas_709858_1954_24000_geo.tif_4097_3073.png
6    KY_Grahn_708756_1962_24000_geo.tif_3841_3329.png
                                                                                                                                         chpPath
1    C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\chips\\train\\images/background/KY_Corbin_708433_1961_24000_geo.tif_2561_769.png
2   C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\chips\\train\\images/background/KY_Kayjay_709002_1959_24000_geo.tif_5377_6145.png
3       C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\chips\\train\\images/positive/KY_Meta_709277_1978_24000_geo.tif_2817_4609.png
4 C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\chips\\train\\images/background/KY_Barthell_803308_1954_24000_geo.tif_3329_5121.png
5   C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\chips\\train\\images/background/KY_Thomas_709858_1954_24000_geo.tif_4097_3073.png
6    C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\chips\\train\\images/background/KY_Grahn_708756_1962_24000_geo.tif_3841_3329.png
                                                                                                                                         mskPth
1    C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\chips\\train\\masks/background/KY_Corbin_708433_1961_24000_geo.tif_2561_769.png
2   C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\chips\\train\\masks/background/KY_Kayjay_709002_1959_24000_geo.tif_5377_6145.png
3       C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\chips\\train\\masks/positive/KY_Meta_709277_1978_24000_geo.tif_2817_4609.png
4 C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\chips\\train\\masks/background/KY_Barthell_803308_1954_24000_geo.tif_3329_5121.png
5   C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\chips\\train\\masks/background/KY_Thomas_709858_1954_24000_geo.tif_4097_3073.png
6    C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\chips\\train\\masks/background/KY_Grahn_708756_1962_24000_geo.tif_3841_3329.png
   division
1 Backround
2 Backround
3   Postive
4 Backround
5 Backround
6 Backround

Describe Chips

The describeChips() function can be used to obtain summary statistics and info about the image chips and masks. This function will return a dataframe that provides the mean, standard deviation, minimum, and maximum values calculated from all of the pixel values or just a random subset of pixels. For the mask, it provides the minimum and maximum codes and the count of pixels assigned to each code. If a large number of chips with many pixels are included, the calculation of statistics can become computationally intensive. So, it is possible to calculate statistics using a random subsample of pixels from each chip and associated mask. You can also use a subset of chips as opposed to all chips. When mode = “Divided” statistics will be calculated for the positive and background-only samples separately and also combined.

The information provided by this function can be useful for specifying data normalization parameters and/or determining class weightings to combat class imbalance issues.

folder: folder in which the image chips and associated masks are stored.
extension: file extension for images and masks. Default is .png.
mode: either “All”, “Positive”, or “Divided”. See explanations above associated with the makeChips() function. Default is “All”.
subSample: whether or not to calculate statistics using a subsample of chips. If TRUE, a subsample is used. If FALSE, a subsample is not used. We encourage using subsampling if there are a large number of chips and masks, as calculating statistics using all chips and masks can be computationally intensive.
numChips: number of chips to use if subSample is TRUE. Ignored if SubSample is FALSE. Default is 200.
numChipsBack: number of background only chips to use if subSample is TRUE and mode is “Divided”. Ignored if subSample is FALSE and mode is not “Divided”.
subSamplePix: whether or not to calculate statistics using a subsample of pixels from included chips. If TRUE, a subsample is used. If FALSE, a subsample is not used. We encourage using subsampling if there are a large number of chips and masks, as calculating statistics using all chips and masks and associated pixels can be computationally intensive.
sampsPerChip: If subSamplePix is TRUE, indicates how many pixels to sample per chip. If subSamplePix is FALSE, setting is ignored. The default is 100.

describeChips <- function(folder, extension, mode="All", subSample=FALSE, numChips=200, numChipsBack=200, subSamplePix=FALSE, sampsPerChip=100){
  chipDF <- data.frame()
  mskStats <- data.frame()
  if(subSample==FALSE)
    if(subSamplePix == FALSE){
      if(mode == "All" | model == "Positive"){
        lstChips <- list.files(paste0(folder, "images/"), pattern=paste0("\\", extension, "$"))
        lstMsk <- list.files(paste0(folder, "masks/"), pattern=paste0("\\", extension, "$"))
        for(chip in lstChips){
          chipIn <- rast(paste0(folder, "images/", chip))
          chipInDF <- data.frame(chipIn)
          nCols <- ncol(chipInDF)
          colNames <- paste0("B", seq(1,nCols))
          names(chipInDF) <- colNames
          chipDF <- bind_rows(chipDF, chipInDF)
        }

        imgStats <- summary(chipDF)
        for(msk in lstMskP){
          mskIn <- rast(paste0(folder, "masks/", msk))
          mskInDF <- freq(mskIn)
          mskStats <- bind_rows(mskStats, mskInDF)
        }
        mskStats2 <- mslStats %>% group_by(value) %>% summarize(cnt = Sum(count))
      }else{
        lstChipsB <- list.files(paste0(folder, "images/background/"), pattern=paste0("\\", extension, "$"))
        lstChipsP <- list.files(paste0(folder, "images/positive/"), pattern=paste0("\\", extension, "$"))
        lstMskB <- list.files(paste0(folder, "masks/background/"), pattern=paste0("\\", extension, "$"))
        lstMskP <- list.files(paste0(folder, "masks/positive/"), pattern=paste0("\\", extension, "$"))
        for(chip in lstChipsB){
          chipIn <- rast(paste0(folder, "images/background/", msk))
          chipInDF <- data.frame(chipIn)
          nCols <- ncol(chipInDF)
          colNames <- paste0("B", seq(1,nCols))
          names(chipInDF) <- colNames
          chipDF <- bind_rows(chipDF, chipInDF)
        }
        for(chip in lstChipsP){
          chipIn <- rast(paste0(folder, "images/positive/", msk))
          chipInDF <- data.frame(chipIn)
          nCols <- ncol(chipInDF)
          colNames <- paste0("B", seq(1,nCols))
          names(chipInDF) <- colNames
          chipDF <- bind_rows(chipDF, chipInDF)
        }

        imgStats <- summary(chipDF)
        for(msk in lstMskB){
          mskIn <- rast(paste0(folder, "masks/background/", msk))
          mskInDF <- freq(mskIn)
          mskStats <- bind_rows(mskStats, mskInDF)
        }
        for(msk in lstMskP){
          mskIn <- rast(paste0(folder, "masks/postive/", msk))
          mskInDF <- freq(mskIn)
          mskStats <- bind_rows(mskStats, mskInDF)
        }
        mskStats2 <- mslStats %>% group_by(value) %>% summarize(cnt = Sum(count))
      }
    }else{
      if(mode == "All" | mode == "Positive"){
        lstChips <- list.files(paste0(folder, "images/"), pattern=paste0("\\", extension, "$"))
        lstMsk <- list.files(paste0(folder, "masks/"), pattern=paste0("\\", extension, "$"))
        for(chip in lstChips){
          chipIn <- rast(paste0(folder, "images/", chip))
          chipInDF <- data.frame(chipIn)
          chipInDF <- chipInDF %>% sample_n(sampsPerChip)
          nCols <- ncol(chipInDF)
          colNames <- paste0("B", seq(1,nCols))
          names(chipInDF) <- colNames
          chipDF <- bind_rows(chipDF, chipInDF)
        }
        imgStats <- summary(chipDF)

        for(msk in lstMskP){
          mskIn <- rast(paste0(folder, "masks/", msk))
          mskInDF <- freq(mskIn)
          mskStats <- bind_rows(mskStats, mskInDF)
        }
        mskStats2 <- mslStats %>% group_by(value) %>% summarize(cnt = Sum(count))
      }else{
        lstChipsB <- list.files(paste0(folder, "images/background/"), pattern=paste0("\\", extension, "$"))
        lstChipsP <- list.files(paste0(folder, "images/positive/"), pattern=paste0("\\", extension, "$"))
        lstMskB <- list.files(paste0(folder, "masks/background/"), pattern=paste0("\\", extension, "$"))
        lstMskP <- list.files(paste0(folder, "masks/positive/"), pattern=paste0("\\", extension, "$"))
        for(chip in lstChipsB){
          chipIn <- rast(paste0(folder, "images/background/", chip))
          chipInDF <- data.frame(chipIn)
          nCols <- ncol(chipInDF)
          colNames <- paste0("B", seq(1,nCols))
          names(chipInDF) <- colNames
          chipDF <- bind_rows(chipDF, chipInDF)
        }
        for(chip in lstChipsP){
          chipIn <- rast(paste0(folder, "images/positive/", chip))
          chipInDF <- data.frame(chipIn)
          nCols <- ncol(chipInDF)
          colNames <- paste0("B", seq(1,nCols))
          names(chipInDF) <- colNames
          chipDF <- bind_rows(chipDF, chipInDF)
        }
        imgStats <- summary(chipDF)

        for(msk in lstMskB){
          mskIn <- rast(paste0(folder, "masks/background/", msk))
          mskInDF <- freq(mskIn)
          mskStats <- bind_rows(mskStats, mskInDF)
        }
        for(msk in lstMskP){
          mskIn <- rast(paste0(folder, "masks/positive/", msk))
          mskInDF <- freq(mskIn)
          mskStats <- bind_rows(mskStats, mskInDF)
        }
        mskStats2 <- mslStats %>% group_by(value) %>% summarize(cnt = Sum(count))
      }
    }else{
      if(subSamplePix == FALSE){
        if(mode == "All" | model == "Positive"){
          lstChips <- list.files(paste0(folder, "images/"), pattern=paste0("\\", extension, "$"))
          lstMsk <- list.files(paste0(folder, "masks/"), pattern=paste0("\\", extension, "$"))
          samps <- sample(seq(1, length(lstChips), 1), numChips)
          lstChips <- lstChips[c(samps)]
          lstMsk <- lstMsk[c(samps)]
          for(chip in lstChips){
            chipIn <- rast(paste0(folder, "images/", chip))
            chipInDF <- data.frame(chipIn)
            nCols <- ncol(chipInDF)
            colNames <- paste0("B", seq(1,nCols))
            names(chipInDF) <- colNames
            chipDF <- bind_rows(chipDF, chipInDF)
          }

          imgStats <- summary(chipDF)
          for(msk in lstMskP){
            mskIn <- rast(paste0(folder, "masks/", msk))
            mskInDF <- freq(mskIn)
            mskStats <- bind_rows(mskStats, mskInDF)
          }
          mskStats2 <- mslStats %>% group_by(value) %>% summarize(cnt = Sum(count))
        }else{
          lstChipsB <- list.files(paste0(folder, "images/background/"), pattern=paste0("\\", extension, "$"))
          lstChipsP <- list.files(paste0(folder, "images/positive/"), pattern=paste0("\\", extension, "$"))
          lstMskB <- list.files(paste0(folder, "masks/background/"), pattern=paste0("\\", extension, "$"))
          lstMskP <- list.files(paste0(folder, "masks/positive/"), pattern=paste0("\\", extension, "$"))
          sampsB <- sample(seq(1, length(lstChipsB), 1), numChipsBack)
          sampsP <- sample(seq(1, length(lstChipsP), 1), numChips)
          lstChipsB <- lstChipsB[c(sampsB)]
          lstMskB <- lstMskB[c(sampsB)]
          lstChipsP <- lstChipsP[c(sampsP)]
          lstMskP <- lstMskP[c(sampsP)]
          for(chip in lstChipsB){
            chipIn <- rast(paste0(folder, "images/background/", chip))
            chipInDF <- data.frame(chipIn)
            nCols <- ncol(chipInDF)
            colNames <- paste0("B", seq(1,nCols))
            names(chipInDF) <- colNames
            chipDF <- bind_rows(chipDF, chipInDF)
          }
          for(chip in lstChipsP){
            chipIn <- rast(paste0(folder, "images/positive/", chip))
            chipInDF <- data.frame(chipIn)
            nCols <- ncol(chipInDF)
            colNames <- paste0("B", seq(1,nCols))
            names(chipInDF) <- colNames
            chipDF <- bind_rows(chipDF, chipInDF)
          }
          imgStats <- summary(chipDF)
          for(msk in lstMskB){
            mskIn <- rast(paste0(folder, "masks/background/", msk))
            mskInDF <- freq(mskIn)
            mskStats <- bind_rows(mskStats, mskInDF)
          }
          for(msk in lstMskP){
            mskIn <- rast(paste0(folder, "masks/positive/", msk))
            mskInDF <- freq(mskIn)
            mskStats <- bind_rows(mskStats, mskInDF)
          }
          mskStats2 <- mslStats %>% group_by(value) %>% summarize(cnt = Sum(count))
        }
      }else{
        if(mode == "All" | mode == "Positive"){
          lstChips <- list.files(paste0(folder, "images/"), pattern=paste0("\\", extension, "$"))
          lstMsk <- list.files(paste0(folder, "masks/"), pattern=paste0("\\", extension, "$"))
          samps <- sample(seq(1, length(lstChips), 1), numChips)
          lstChips <- lstChips[c(samps)]
          lstMsk <- lstMsk[c(samps)]
          for(chip in lstChips){
            chipIn <- rast(paste0(folder, "images/", chip))
            chipInDF <- data.frame(chipIn)
            chipInDF <- chipInDF %>% sample_n(sampsPerChip)
            nCols <- ncol(chipInDF)
            colNames <- paste0("B", seq(1,nCols))
            names(chipInDF) <- colNames
            chipDF <- bind_rows(chipDF, chipInDF)
          }
          nCols <- ncol(chipDF)
          colNames <- paste0("B", seq(1,nCols))
          names(chipDF) <- colNames
          imgStats <- summary(chipDF)

          for(msk in lstMskP){
            mskIn <- rast(paste0(folder, "masks/", msk))
            mskInDF <- freq(mskIn)
            mskStats <- bind_rows(mskStats, mskInDF)
          }
          mskStats2 <- mslStats %>% group_by(value) %>% summarize(cnt = Sum(count))
        }else{
          lstChipsB <- list.files(paste0(folder, "images/background/"), pattern=paste0("\\", extension, "$"))
          lstChipsP <- list.files(paste0(folder, "images/positive/"), pattern=paste0("\\", extension, "$"))
          lstMskB <- list.files(paste0(folder, "masks/background/"), pattern=paste0("\\", extension, "$"))
          lstMskP <- list.files(paste0(folder, "masks/positive/"), pattern=paste0("\\", extension, "$"))
          sampsB <- sample(seq(1, length(lstChipsB), 1), numChipsBack)
          sampsP <- sample(seq(1, length(lstChipsP), 1), numChips)
          lstChipsB <- lstChipsB[c(sampsB)]
          lstMskB <- lstMskB[c(sampsB)]
          lstChipsP <- lstChipsP[c(sampsP)]
          lstMskP <- lstMskP[c(sampsP)]
          for(chip in lstChipsB){
            chipIn <- rast(paste0(folder, "images/background/", chip))
            chipInDF <- data.frame(chipIn)
            nCols <- ncol(chipInDF)
            colNames <- paste0("B", seq(1,nCols))
            names(chipInDF) <- colNames
            chipDF <- bind_rows(chipDF, chipInDF)
          }
          for(chip in lstChipsP){
            chipIn <- rast(paste0(folder, "images/positive/", chip))
            chipInDF <- data.frame(chipIn)
            nCols <- ncol(chipInDF)
            colNames <- paste0("B", seq(1,nCols))
            names(chipInDF) <- colNames
            chipDF <- bind_rows(chipDF, chipInDF)
          }
          imgStats <- summary(chipDF)

          for(msk in lstMskB){
            mskIn <- rast(paste0(folder, "masks/background/", msk))
            mskInDF <- freq(mskIn)
            mskStats <- bind_rows(mskStats, mskInDF)
          }
          for(msk in lstMskP){
            mskIn <- rast(paste0(folder, "masks/positive/", msk))
            mskInDF <- freq(mskIn)
            mskStats <- bind_rows(mskStats, mskInDF)
          }
          mskStats2 <- mskStats %>% group_by(value) %>% summarize(cnt = sum(count))
        }
      }
      outStats <- list(ImageStats=imgStats, mskStats=mskStats2)
      return(outStats)
    }
}

In this example, I am calculating statistics for the training chips. I am using a subsample of 1,000 presence chips and 250 background-only chips. I am also using a subsample of 100 pixels per chip. Since the function returns a list object containing two dataframes, I then write the dataframes out to disk separately.

trainStats <- describeChips("C:/myFiles/work/topo_dl_data/topo_dl_data/processing/chips/train/", 
                            "png", mode="Divided", subSample=TRUE, numChips = 1000, numChipsBack = 250, 
                            subSamplePix=TRUE, sampsPerChip=100)

write.csv(trainStats$ImageStats, "C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\trainStatsImg.csv")
write.csv(trainStats$mskStats, "C:\\myFiles\\work\\topo_dl_data\\topo_dl_data\\processing\\trainStatsMsk.csv")

Concluding Remarks

You can now generate raster masks, image chips, lists of image chips, and descriptive statistics using our R functions. In the next section, we will use the data generated in this module to train a semantic segmentation model using the Segmentation Models package.