Quick Start

This guide will help you get started with RTnn quickly.

Basic Usage

from rtnn import DataPreprocessor, RNN_LSTM
from rtnn.logger import Logger

# Initialize logger
logger = Logger(console_output=True)

# Load your data
dataset = DataPreprocessor(
    logger=logger,
    dfs=["data_1995.nc", "data_1996.nc"],
    stime=0,
    tstep=100,
    tbatch=24,
    norm_mapping=norm_mapping,
    normalization_type=normalization_type
)

# Create model
model = RNN_LSTM(
    feature_channel=6,
    output_channel=4,
    hidden_size=128,
    num_layers=3
)

# Train (simplified)
for epoch in range(num_epochs):
    for features, targets in dataloader:
        outputs = model(features)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

Command Line Interface

Train a LSTM Model with 121 input features, 256 hidden units, 3 layers, and 120 output channels:

rtnn \
  --root_dir "./" \  # Project root directory
  --main_folder "Prod__lstm_h256_l3_d0d1_sb_4_ne_100" \  # Main experiment folder
  --sub_folder "nrm_log1p_standard_lr_0d0001_beta_0d5" \  # Run-specific subfolder
  --prefix "nrm_log1p_standard_lr_0d0001_beta_0d5" \  # Output/checkpoint prefix
  --dataset_type "LSM" \  # Dataset type
  --type "lstm" \  # Model type
  --hidden_size "256" \  # Hidden layer size
  --num_layers "3" \  # Number of layers
  --output_channel "120" \  # Output feature dimension
  --seq_length "10" \  # Input sequence length
  --feature_channel "121" \  # Input feature dimension
  --embed_size "256" \  # Embedding size (just for transformer)
  --nhead "4" \  # Number of attention heads (just for transformer)
  --forward_expansion "4" \  # Feed-forward expansion factor (just for transformer)
  --dropout "0.1" \  # Dropout rate
  --model_name "lstm_h256_l3_d0d1" \  # Model identifier
  --batch_size "4" \  # Batch size
  --num_epochs "100" \  # Number of training epochs
  --learning_rate "0.0001" \  # Learning rate
  --loss_type "huber" \  # Loss function
  --beta "0.5" \  # Loss weighting parameter
  --beta_delta "1.0" \  # Secondary loss scaling factor
  --train_data_files "/path/to/training/data" \  # Training NetCDF4 dataset path
  --test_data_files "/path/to/testing/data" \  # Testing NetCDF4 dataset path
  --train_years "1995-1999" \  # Training time range
  --test_year "2000" \  # Test year
  --norm "log1p_standard" \  # Normalization method
  --num_workers "4" \  # DataLoader worker threads
  --save_model "True" \  # Save final model
  --save_checkpoint_name "model" \  # Checkpoint filename
  --save_per_samples "10000" \  # Save interval (samples)
  --run_type "train" \  # Run mode: training
  --seed "42" \  # Random seed
  --debug "False"  # Debug mode

Show version:

rtnn --version

Show help:

rtnn --help

Performance Optimization

RTnn is desinged to be efficient on single GPU systems, but performance can vary significantly based on how you configure SLURM parameters and PyTorch DataLoader workers. For optimal performance, different configurations of SLURM parameters and PyTorch DataLoader workers were tested. The following table summarizes the results. Note that ntasks-per-node × cpus-per-task is set to a maximum of 4.

Performance Comparison on HAL Machine

Test

ntasks-per-node

cpus-per-task

num_workers

SBATCH lines

Epoch 0 Time (s)

Epoch 1 Time (s)

Total Time (s)

1

default (1)

default

0

(omit both)

1179.8

742.2

1922.0

2

default (1)

default

4

(omit both)

333.5

280.7

614.2

3

default (1)

4

0

#SBATCH –cpus-per-task=4

1150.6

997.1

2147.7

4

default (1)

4

4

#SBATCH –cpus-per-task=4

375.5

286.8

662.3

5

4

default

0

#SBATCH –ntasks-per-node=4

1898.0

(Missing)

~1898.0+

6

4

default

4

#SBATCH –ntasks-per-node=4

389.5

283.6

673.1

7

2

2

0

Both lines

1114.6

753.2

1867.8

8

2

2

4

Both lines

338.6

275.9

614.5

Configuration Notes:

  • default for ntasks-per-node means 1 (system default)

  • default for cpus-per-task means the parameter is omitted (system default)

  • Epoch 0 includes initial data loading, cache warm-up, and JIT compilation

  • Epoch 1 shows steady-state performance after optimization

  • Maximum total cores constraint: ntasks-per-node × cpus-per-task ≤ 4

  • Test 5 missing epoch 1 due to job timeout (I/O bottleneck)

Performance Analysis and Key Findings:

  1. Dramatic Impact of DataLoader Workers:

    • Adding num_workers=4 reduces total time by 68% (Test 1→2: 1922s → 614s)

    • Without workers, all configurations perform poorly (1867-2148s total time)

    • Workers effectively overlap I/O with GPU computation

  2. Optimal Configuration: Test 2 and Test 8:

    • Test 2 (default/4 workers): 614.2s total, 280.7s epoch 1

    • Test 8 (ntasks=2, cpus=2, workers=4): 614.5s total, 275.9s epoch 1

    • Both achieve 3.1x speedup over baseline (Test 1)

    • Simple configuration (Test 2) is easiest to implement

  3. Poor Configurations to Avoid:

    • Test 5 (ntasks=4, no workers): 1898s epoch 0, incomplete - I/O saturation causes timeout

    • Test 3 & 7 (workers=0): 1867-2148s total time - CPU cores wasted without workers

    • Adding CPU cores without workers provides no benefit (Test 1 vs 3 vs 7)

  4. The Workers Effect:

    • With workers (Tests 2,4,6,8): Epoch 0: 333-390s, Epoch 1: 276-287s

    • Without workers (Tests 1,3,5,7): Epoch 0: 1115-1898s, Epoch 1: 742-997s

    • Workers reduce epoch 0 time by 70-80% and epoch 1 time by 62-71%

  5. CPU Core Allocation Impact:

    • With workers, CPU core allocation has minimal effect (614-673s total)

    • Without workers, more cores actually hurt performance (Test 5: 1898s vs Test 1: 1922s)

    • Suggests I/O is the primary bottleneck, not CPU compute

Best Configuration (Simple and Effective):

#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1

# In your training script:
num_workers=4

Next Steps