Skip to content

jgenvironment/ClustMod

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ClustMod

Version 4.0
Author: Joschka Geissler
Last modified: 25 January 2025

Background

This documentation introduces the required steps for applying the ClustMod model. ClustMod is a model that allows the prediction of snow distribution patterns based on Canopy Height Models (CHM) and Digital Terrain Models (DTM). ClustMod is implemented in R (Version 4.1.0) and requires the following libraries to be installed:

Table 1: R-packages required for running the ClustMod model

Name Version Literature
Rstudioapi 0.13 Ushey et al. (2014)
raster 3.4-13 Hijmans (2021)
gstat 2.0-8 Pebesma and Graeler (2003)
tidyverse 1.3.2 Wickham et al. (2019)
sf 1.0-9 Pebesma (2018)
whitebox 2.3.0 Wu and Brown (2022) and Lindsay (2016)
assertthat 0.2.1 Wickham (2013)
ForestTools 0.2.5 Plowright (2017)
tmap 3.3-4 Tennekes (2018)
R.utils 2.12.2 Bengtsson (2005)
reshape2 1.4.4 Gräler et al. (2016)
windninjr 0.0.3 Raymond (2020)
caret 6.0-88 Kuhn (2008)

Snow distribution patterns are based on the ClustSnow workflow, a workflow that performs an unsupervised classification (clustering using k-Means and random forest algorithms) of a stack of multitemporal HS maps. ClustSnow was first introduced by Geissler et al. (2023) and Geissler, Mazzotti, et al. (2024) and is available for download from Geissler and Weiler (2024).

Workflow

The ClustMod model first determines all model features for each study site using the PrepareData function. This function involves the application of several third party models and algorithms:

  • The WindNinja (Raymond, 2020) model is applied to spatialize values of windspeed and direction across the study domains. WindNinja must be installed and the path to the executable must be indicated in the configuration files. Note that functions from the windninjr package are available as an independent script as an update to the windninjr package lead to problems in executing ClustMod.
  • The DCE algorithm, first presented by Mazzotti et al. (2019), to determine forest structure metrics based on the distance to the canopy edges (DCE).
  • A workflow to derive the Topographic Wetness Index as presented by Lindsay (2016).

For all study sites where the isTrain argument of the config_data_preparation configuration file is TRUE, clusters are derived from available HS maps using the getCluster function. Subsequently, one random forest model is trained for each of the four (n_class) clusters, based on the features (listed in features) as independent variables and cluster probabilities as dependent variable for all sites that are listed as train_sites. Note that for this version of ClustMod, only the number of clusters and therefore random forest models must be four (n_class=4).

The available code further allows for an evaluation of the ClustMod model. Therefore, the Overall Accuracy (OA) is derived for a small, randomly selected subset of the training dataset (size determined by 1 - train_test_split). Therefore, for each grid cell, the winner clusters (cluster number with the highest probability) are compared between the predicted clusters and the cluster derived from the observations using the getCluster function.

Subsequently, the uploaded code allows for a prediction of clusters for the sites that are listed in the configurations file (file config_ClustMod; variable test_sites).

Predicted clusters can serve as a basis for extrapolating point measurements or model simulations. For more information we refer to Geissler (2025).

Setting Up the Folder and Data Structure

The steps required for running ClustMod involve setting up i) the right folder and data structure, ii) creating the individual configuration files and finally iii) run the R-scripts. This documentation will explain each of these steps in the following.

This documentation presents the version 4.0 of ClustMod, as presented in Geissler (2025). To run this specific version of ClustMod, different external datasets are required that are listed in Table 2. From these data sets, the CHM, DTM and available HS maps are required together with a coarser-level DTM containing each of the study sites. DTM, CHM and HS maps must be resampled to 1m spatial resolution. Large-scale DTMs are available worldwide, among others, from the Shuttle Radar Topography Mission (SRTM).

Table 2: Data sets required for reproduction ClustMod as presented by Geissler (2025)

Study Site Country Last Year of Data Acquisition Citation
Alptal Switzerland 2023 Geissler, Rathmann, and Weiler (2024)
Schauinsland Germany 2023 Geissler, Rathmann, and Weiler (2024)
Fluela_North Switzerland 2020 Koutantou et al. (2022)
Fluela_South Switzerland 2020 Koutantou et al. (2022)
Fluela Switzerland 2017 Mazzotti et al. (2023) and Mazzotti and Jonas (2022)

For a successful application of ClustMod, a specific folder structure is required. The working directory must contain the ClustMod script (ClustMod_v4.R) as well as ClustMod’s and ClustSnow’s functions (ClustMod_Functions_v4.R, ClustSnow_Functions_v4.R and windninjr.R). Data sets must be stored in individual folders located within the Data_ClustMod folder. Besides these data sets, the Data_ClustMod folder contains the configuration file config_ClustMod.txt.

For all data sets, the CHM and DTM must be placed in the respective folders and, if they should be considered as a training site, a stack with all HS observations is required additionally. File paths to these raster data must be specified in the config_data_preparation configuration file that is located within each individual data folder. Figure 1 illustrates this data and folder structure for the ClustMod model as presented by Geissler (2025).

grafik

Figure 1: Folder Structure for setting up the ClustMod Model

Configuration file “config_ClustMod”

The configuration file config_ClustMod must be stored in the Data_ClustMod folder. This file contains all parameters required to define the ClustMod model. Each line in config_ClustMod must be written in valid R syntax to ensure it can be interpreted by R.

Key parameters in the configuration file include:

  • path_in: Specifies the directory path where the input data is stored.
  • path_out: Defines the directory path where the model output will be written.
  • experiment and model: Specify the name of the individual model experiment.
  • train_sites: A vector containing the names of the study sites to be used for training the ClustMod model. The names must exactly match the corresponding folder names in Data_ClustMod.
  • test_sites: A vector containing the names of the study sites for which clusters should be predicted. These names must also match their respective folder names.
  • features: A vector listing all the independent variables to be used during training of the ClustMod model.
  • n_class: The number of clusters and thus random forest models the ClustMod model should be built upon (Only value 4 possible in the published version of ClustMod).
  • train_test_split: Defines how many data points of the training data set should be used for training of the ClustMod model, and how many for the subsequent determination of the OA (testing).

The parameters sample_length, kmeans_maxiter, kmeans_nstart, mtry and n_trees are parameters related to the ClustSnow workflow. More details can be found in Geissler, Mazzotti, et al. (2024) and Geissler et al. (2023) including a sensitivity analysis and calibration results.

The following lines show the config_ClustMod configuration file for the example presented by Geissler (2025).

# ClustMod Parameters
path_in                   = "Data_ClustMod"
path_out                  = "Output"
experiment                = "20250101"
model                     = "v0"
train_sites               = c('Alptal','Schauinsland','Fluela_North','Fluela_South')
test_sites                = c('Fluela')
features                  = c('CC_5', 'CC_50', 'CHM_1', 'CHM_5', 'CHM_50', 'CHMmed_5', 'DCE_1', 'DIST_1','LWDCE_1', 'NDCE_1', 'NN_CHM_5', 'NN_CHM_20', 'NN_CHM_50', 'NN_DTM_5', 'NN_DTM_20','NN_DTM_50','SDCE_1', 'TPI_1', 'TPI_5', 'TPI_90', 'TWI_1', 'TWI_5','WFDCE_1','WNDIR_1','WNVEL_1')
train_test_split          = 0.8

# ClustSnow Parameters
sample_length             = 1600
n_class                   = 4
kmeans_maxiter            = 16
kmeans_nstart             = 39
mtry                      = 4
n_trees                   = 451

Configuration file “config_data_preparation”

As for config_ClustMod, each line of the config_data_preparation configuration file must be written in valid R syntax. Each data set must contain such a configuration file. It contains all relevant data paths and information needed for the individual study sites.

  • finalize: Using the parameter finalize can allow the applicant of the ClustMod model to select whether (TRUE) or not (FALSE) the PrepareData function should be applied to this study site and thus if features should be (re)calculated for this study site. This parameter therefore allows the applicant to avoid the recalculation of features after small adjustments have been made to the model.
  • dtm_path, dtm_largescale_path and chm_path: Defines the absolute or relative (within the folder of the data set) paths of the DTM, larger-scale DTM (e.g., SRTM) and CHM.
  • crs_dtm, crs_dtm_largescale and crs_chm: Define the coordinate systems of provided DTMs and CHMs using EPSG codes.
  • dir_out: Defines the folder where data and logfiles should be written to.
  • buffer_size: Defines the buffer size, in meters, by which the SRTM data should extend beyond the provided DTM. This buffer is necessary for calculating large-scale topographic parameters, such as the topographic position index (TPI), or for the application of the WindNinja model.
  • isTrain: This parameter defines whether or not clusters should be calculated from existing HS data. Note that for all data sets listed as train_sites, this parameter must be TRUE. For all data sets listed in test_sites, this parameter should be FALSE.
  • snow_depth_path: This parameter must indicate the path to the HS data. This is required when isTrain is TRUE or when the application will be performed. Otherwise, this parameter can be left out.
  • crs_snowdata: This parameter must indicate the coordinate system of the HS data using EPSG codes. This is required when isTrain is TRUE or when the application will be performed. Otherwise, this parameter can be left out.
  • wind_direction: The dominant wind direction during winter months within the domain of the data set in degrees. In Geissler (2025), this value was obtained from the ERA5 meteorological reanalysis model (Hersbach et al., 2020).
  • wind_speed: The average wind speed during winter months within the domain of the data set in meters per second, measured at 10 m above ground. In Geissler (2025), this value was obtained from the ERA5 meteorological reanalysis model (Hersbach et al., 2020).
  • path_to_wn_exe: For running the WindNinja model, the path to the WindNinja executable must be specified.

The following lines show the config_data_preparation configuration file for the example presented by Geissler (2025) and the Alptal data set.

# ClustMod Parameters
dtm_path                  = ".\\RAW\\dtm_1.asc"
dtm_largescale_path       = ".\\SRTM.asc"
chm_path                  = ".\\RAW\\chm_1.asc"
crs_dtm                   = "EPSG:32632"
crs_chm                   = "EPSG:32632"
crs_dtm_largescale        = "EPSG:4326"

finalize                  = FALSE
buffer_size               = 90
DCE_step_nr               = 300

isTrain                   = TRUE
snow_depth_path           = ".\\snow-depth"
crs_snowdata              = "EPSG:2056"

# WindNinja Parameters
wind_direction            = 120 
wind_velocity             = 1.3
path_to_wn_exe            = "C:\\WindNinja\\WindNinja-3.6.0\\bin\\WindNinja_cli.exe"

Data Output

All output of ClustMod is stored in the defined output path (path_out).

For all sites listed in test_sites, the clusters are predicted using the trained ClustMod model. The application of ClustMod produces logfiles to store essential information regarding the model execution. These logfiles include the following details:

  • Model parameters: Key settings and configurations used during the model's operation.
  • Runtime information: Details about the execution process, including timestamps and processing durations.
  • Messages, warnings, and errors: Any messages generated by R during execution, including warnings and errors, providing insights for debugging and ensuring reproducibility.

References