Quick Start
This guide will help you get started with RTnn quickly.
Basic Usage
from rtnn import DataPreprocessor, RNN_LSTM
from rtnn.logger import Logger
# Initialize logger
logger = Logger(console_output=True)
# Load your data
dataset = DataPreprocessor(
logger=logger,
dfs=["data_1995.nc", "data_1996.nc"],
stime=0,
tstep=100,
tbatch=24,
norm_mapping=norm_mapping,
normalization_type=normalization_type
)
# Create model
model = RNN_LSTM(
feature_channel=6,
output_channel=4,
hidden_size=128,
num_layers=3
)
# Train (simplified)
for epoch in range(num_epochs):
for features, targets in dataloader:
outputs = model(features)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
Command Line Interface
Train a LSTM Model with 121 input features, 256 hidden units, 3 layers, and 120 output channels:
rtnn \
--root_dir "./" \ # Project root directory
--main_folder "Prod__lstm_h256_l3_d0d1_sb_4_ne_100" \ # Main experiment folder
--sub_folder "nrm_log1p_standard_lr_0d0001_beta_0d5" \ # Run-specific subfolder
--prefix "nrm_log1p_standard_lr_0d0001_beta_0d5" \ # Output/checkpoint prefix
--dataset_type "LSM" \ # Dataset type
--type "lstm" \ # Model type
--hidden_size "256" \ # Hidden layer size
--num_layers "3" \ # Number of layers
--output_channel "120" \ # Output feature dimension
--seq_length "10" \ # Input sequence length
--feature_channel "121" \ # Input feature dimension
--embed_size "256" \ # Embedding size (just for transformer)
--nhead "4" \ # Number of attention heads (just for transformer)
--forward_expansion "4" \ # Feed-forward expansion factor (just for transformer)
--dropout "0.1" \ # Dropout rate
--model_name "lstm_h256_l3_d0d1" \ # Model identifier
--batch_size "4" \ # Batch size
--num_epochs "100" \ # Number of training epochs
--learning_rate "0.0001" \ # Learning rate
--loss_type "huber" \ # Loss function
--beta "0.5" \ # Loss weighting parameter
--beta_delta "1.0" \ # Secondary loss scaling factor
--train_data_files "/path/to/training/data" \ # Training NetCDF4 dataset path
--test_data_files "/path/to/testing/data" \ # Testing NetCDF4 dataset path
--train_years "1995-1999" \ # Training time range
--test_year "2000" \ # Test year
--norm "log1p_standard" \ # Normalization method
--num_workers "4" \ # DataLoader worker threads
--save_model "True" \ # Save final model
--save_checkpoint_name "model" \ # Checkpoint filename
--save_per_samples "10000" \ # Save interval (samples)
--run_type "train" \ # Run mode: training
--seed "42" \ # Random seed
--debug "False" # Debug mode
Show version:
rtnn --version
Show help:
rtnn --help
Performance Optimization
RTnn is desinged to be efficient on single GPU systems, but performance can vary significantly based on how you configure SLURM parameters and PyTorch DataLoader workers. For optimal performance, different configurations of SLURM parameters and PyTorch DataLoader workers were tested. The following table summarizes the results. Note that ntasks-per-node × cpus-per-task is set to a maximum of 4.
Test |
ntasks-per-node |
cpus-per-task |
num_workers |
SBATCH lines |
Epoch 0 Time (s) |
Epoch 1 Time (s) |
Total Time (s) |
|---|---|---|---|---|---|---|---|
1 |
default (1) |
default |
0 |
(omit both) |
1179.8 |
742.2 |
1922.0 |
2 |
default (1) |
default |
4 |
(omit both) |
333.5 |
280.7 |
614.2 |
3 |
default (1) |
4 |
0 |
#SBATCH –cpus-per-task=4 |
1150.6 |
997.1 |
2147.7 |
4 |
default (1) |
4 |
4 |
#SBATCH –cpus-per-task=4 |
375.5 |
286.8 |
662.3 |
5 |
4 |
default |
0 |
#SBATCH –ntasks-per-node=4 |
1898.0 |
(Missing) |
~1898.0+ |
6 |
4 |
default |
4 |
#SBATCH –ntasks-per-node=4 |
389.5 |
283.6 |
673.1 |
7 |
2 |
2 |
0 |
Both lines |
1114.6 |
753.2 |
1867.8 |
8 |
2 |
2 |
4 |
Both lines |
338.6 |
275.9 |
614.5 |
Configuration Notes:
defaultfor ntasks-per-node means 1 (system default)defaultfor cpus-per-task means the parameter is omitted (system default)Epoch 0 includes initial data loading, cache warm-up, and JIT compilation
Epoch 1 shows steady-state performance after optimization
Maximum total cores constraint: ntasks-per-node × cpus-per-task ≤ 4
Test 5 missing epoch 1 due to job timeout (I/O bottleneck)
Performance Analysis and Key Findings:
Dramatic Impact of DataLoader Workers:
Adding num_workers=4 reduces total time by 68% (Test 1→2: 1922s → 614s)
Without workers, all configurations perform poorly (1867-2148s total time)
Workers effectively overlap I/O with GPU computation
Optimal Configuration: Test 2 and Test 8:
Test 2 (default/4 workers): 614.2s total, 280.7s epoch 1
Test 8 (ntasks=2, cpus=2, workers=4): 614.5s total, 275.9s epoch 1
Both achieve 3.1x speedup over baseline (Test 1)
Simple configuration (Test 2) is easiest to implement
Poor Configurations to Avoid:
Test 5 (ntasks=4, no workers): 1898s epoch 0, incomplete - I/O saturation causes timeout
Test 3 & 7 (workers=0): 1867-2148s total time - CPU cores wasted without workers
Adding CPU cores without workers provides no benefit (Test 1 vs 3 vs 7)
The Workers Effect:
With workers (Tests 2,4,6,8): Epoch 0: 333-390s, Epoch 1: 276-287s
Without workers (Tests 1,3,5,7): Epoch 0: 1115-1898s, Epoch 1: 742-997s
Workers reduce epoch 0 time by 70-80% and epoch 1 time by 62-71%
CPU Core Allocation Impact:
With workers, CPU core allocation has minimal effect (614-673s total)
Without workers, more cores actually hurt performance (Test 5: 1898s vs Test 1: 1922s)
Suggests I/O is the primary bottleneck, not CPU compute
Best Configuration (Simple and Effective):
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
# In your training script:
num_workers=4
Next Steps
Explore Neural Architectures for different model types
Learn about Training Strategy for optimal training
Check Inference Modes for running predictions
See RTnn API Reference for detailed API reference