Training Strategy

This section describes the training configuration, data preparation pipeline, and optimization strategies for RTnn.

Data Preparation

RTnn processes NetCDF files containing LSM data with the naming convention: rtnetcdf_{processor_rank:03d}_{year}.nc

_images/canopy.png — Structure of the vegetation canopy

Vertical Canopy Structure:

The model simulates radiative transfer through multiple vertical layers (default: 10 layers), representing different heights within the vegetation canopy. Each layer has its own optical properties (leaf area index, single scattering albedo, etc.) that influence radiation propagation.

Processor Rank Distribution:

Data is distributed across multiple processor ranks (for example 16 ranks). For training, a random subset (60%) is selected in each epoch to reduce bias and improve generalization.

_images/random_rank.png — Example of randomly selected MPI blocks used during training

Random Spatial Mapping:

During training, a new random spatial mapping is generated for each time step
60% of processor ranks are randomly selected (data augmentation)
The same mapping applies to all spatial batches within a time step
Validation/testing uses 100% of processor ranks (deterministic)

Input Features (121 channels):

The input features are constructed from four variable groups, flattened across Plant Functional Types (PFTs, 15 types) and spectral bands (2 bands: VIS and NIR):

Input Channel Composition
Variable Group	Channels	Description
coszang	1	Cosine of solar zenith angle (direct beam direction)
laieff_collim, laieff_isotrop	30 (2 × 15 PFTs)	Leaf area index for collimated and isotropic radiation
leaf_ssa, leaf_psd	60 (2 × 2 bands × 15 PFTs)	Single scattering albedo and phase function asymmetry
rs_surface_emu	30 (1 × 2 bands × 15 PFTs)	Surface reflectance (soil albedo)
Total	121

Output Targets (120 channels):

The model predicts four output variables for each PFT and band combination:

Output Channel Composition
Variable Group	Channels	Description
collim_alb	30 (15 PFTs × 2 bands)	Collimated albedo (reflected direct radiation)
collim_tran	30 (15 PFTs × 2 bands)	Collimated transmittance (transmitted direct radiation)
isotrop_alb	30 (15 PFTs × 2 bands)	Isotropic albedo (reflected diffuse radiation)
isotrop_tran	30 (15 PFTs × 2 bands)	Isotropic transmittance (transmitted diffuse radiation)
Total	120

Physical Constraint:

For each layer, the outputs satisfy energy conservation:

\[\text{albedo} + \text{transmittance} + \text{absorption} = 1\]

This constraint is enforced as a soft penalty during training.

Hyperparameters

Learning Rate

--learning_rate 0.0001

Recommendations:

LSTM/GRU: 1e-3 to 1e-4
Transformer: 1e-4 to 5e-5
VerticalRT: 1e-4 to 5e-5

Batch Size

--batch_size 4
--tbatch 24  # 24 temporal batch size for the validation, and 1 for inference

Note: For VerticalRT with 120 output channels, reduce batch size to 4-8 due to memory constraints.

Loss Functions

--loss_type huber
--beta 0.2  # Weight for absorption loss
--beta_delta 1.0  # For Huber/SmoothL1

Loss Function Options:

Loss Function Types
Loss Type	Best For	Description
`mse`	General purpose	Standard Mean Squared Error
`mae`	Robust to outliers	Mean Absolute Error
`huber`	Balanced	Combines MSE and MAE
`smoothl1`	Similar to Huber	Smooth L1 Loss
`nmae`	Scale-invariant	Normalized MAE
`nmse`	Scale-invariant	Normalized MSE

Weighted Loss Formula:

\[\mathcal{L}_{\text{total}} = (1 - \beta) \cdot \mathcal{L}_{\text{fluxes}} + \beta \cdot (\mathcal{L}_{\text{abs12}} + \mathcal{L}_{\text{abs34}})\]

Where: - \(\beta\) controls the trade-off between flux accuracy and absorption accuracy - Typical \(\beta\) values: 0.05 - 0.3

Learning Rate Scheduling

RTnn uses ReduceLROnPlateau scheduler:

scheduler = ReduceLROnPlateau(
    optimizer,
    factor=0.5,      # Reduce LR by 50%
    patience=5,      # Wait 5 epochs before reducing
    mode='min'       # Monitor validation loss minimum
)

The learning rate is reduced when validation loss plateaus.

Optimizer

Adam optimizer with default parameters:

optimizer = torch.optim.Adam(
    model.parameters(),
    lr=learning_rate
)

Training Tips

For LSTM/GRU:

Hidden size: 128-256 for 121 inputs
Number of layers: 2-4
Dropout: 0.2-0.4

For Transformer:

Embed size: 256-512
Number of heads: embed_size // 64
Forward expansion: 2-4
Dropout: 0.1-0.3

For VerticalRT:

Hidden size: 256
Layer embedding dimension: 16
Dropout: 0.1

General Tips:

Start with smaller learning rate (1e-4) for larger models
Use gradient clipping for stable training
Monitor conservation penalty (should approach 0)
Save checkpoints every 10 epochs
Use mixed precision training for memory efficiency

See Neural Architectures for architecture-specific details.