Quick Start
===========

This guide will help you get started with RTnn quickly.

Basic Usage
-----------

.. code-block:: python

   from rtnn import DataPreprocessor, RNN_LSTM
   from rtnn.logger import Logger

   # Initialize logger
   logger = Logger(console_output=True)

   # Load your data
   dataset = DataPreprocessor(
       logger=logger,
       dfs=["data_1995.nc", "data_1996.nc"],
       stime=0,
       tstep=100,
       tbatch=24,
       norm_mapping=norm_mapping,
       normalization_type=normalization_type
   )

   # Create model
   model = RNN_LSTM(
       feature_channel=6,
       output_channel=4,
       hidden_size=128,
       num_layers=3
   )

   # Train (simplified)
   for epoch in range(num_epochs):
       for features, targets in dataloader:
           outputs = model(features)
           loss = criterion(outputs, targets)
           loss.backward()
           optimizer.step()

Command Line Interface
----------------------

Train a LSTM Model with 121 input features, 256 hidden units, 3 layers, and 120 output channels:

.. code-block:: bash

      rtnn \
        --root_dir "./" \  # Project root directory
        --main_folder "Prod__lstm_h256_l3_d0d1_sb_4_ne_100" \  # Main experiment folder
        --sub_folder "nrm_log1p_standard_lr_0d0001_beta_0d5" \  # Run-specific subfolder
        --prefix "nrm_log1p_standard_lr_0d0001_beta_0d5" \  # Output/checkpoint prefix
        --dataset_type "LSM" \  # Dataset type
        --type "lstm" \  # Model type
        --hidden_size "256" \  # Hidden layer size
        --num_layers "3" \  # Number of layers
        --output_channel "120" \  # Output feature dimension
        --seq_length "10" \  # Input sequence length
        --feature_channel "121" \  # Input feature dimension
        --embed_size "256" \  # Embedding size (just for transformer)
        --nhead "4" \  # Number of attention heads (just for transformer)
        --forward_expansion "4" \  # Feed-forward expansion factor (just for transformer)
        --dropout "0.1" \  # Dropout rate
        --model_name "lstm_h256_l3_d0d1" \  # Model identifier
        --batch_size "4" \  # Batch size
        --num_epochs "100" \  # Number of training epochs
        --learning_rate "0.0001" \  # Learning rate
        --loss_type "huber" \  # Loss function
        --beta "0.5" \  # Loss weighting parameter
        --beta_delta "1.0" \  # Secondary loss scaling factor
        --train_data_files "/path/to/training/data" \  # Training NetCDF4 dataset path
        --test_data_files "/path/to/testing/data" \  # Testing NetCDF4 dataset path
        --train_years "1995-1999" \  # Training time range
        --test_year "2000" \  # Test year
        --norm "log1p_standard" \  # Normalization method
        --num_workers "4" \  # DataLoader worker threads
        --save_model "True" \  # Save final model
        --save_checkpoint_name "model" \  # Checkpoint filename
        --save_per_samples "10000" \  # Save interval (samples)
        --run_type "train" \  # Run mode: training
        --seed "42" \  # Random seed
        --debug "False"  # Debug mode

Show version:

.. code-block:: bash

   rtnn --version

Show help:

.. code-block:: bash

   rtnn --help

Performance Optimization
------------------------
RTnn is desinged to be efficient on single GPU systems, but performance can vary significantly based on how you configure SLURM parameters and PyTorch DataLoader workers.
For optimal performance, different configurations of SLURM parameters and PyTorch DataLoader workers were tested. The following table summarizes the results.
Note that `ntasks-per-node × cpus-per-task` is set to a maximum of 4.

.. list-table:: Performance Comparison on HAL Machine
   :header-rows: 1
   :widths: 5 15 15 15 25 15 15 15
   :align: center

   * - Test
     - ntasks-per-node
     - cpus-per-task
     - num_workers
     - SBATCH lines
     - Epoch 0 Time (s)
     - Epoch 1 Time (s)
     - Total Time (s)
   * - 1
     - default (1)
     - default
     - 0
     - (omit both)
     - 1179.8
     - 742.2
     - 1922.0
   * - 2
     - default (1)
     - default
     - 4
     - (omit both)
     - 333.5
     - 280.7
     - 614.2
   * - 3
     - default (1)
     - 4
     - 0
     - #SBATCH --cpus-per-task=4
     - 1150.6
     - 997.1
     - 2147.7
   * - 4
     - default (1)
     - 4
     - 4
     - #SBATCH --cpus-per-task=4
     - 375.5
     - 286.8
     - 662.3
   * - 5
     - 4
     - default
     - 0
     - #SBATCH --ntasks-per-node=4
     - 1898.0
     - (Missing)
     - ~1898.0+
   * - 6
     - 4
     - default
     - 4
     - #SBATCH --ntasks-per-node=4
     - 389.5
     - 283.6
     - 673.1
   * - 7
     - 2
     - 2
     - 0
     - Both lines
     - 1114.6
     - 753.2
     - 1867.8
   * - 8
     - 2
     - 2
     - 4
     - Both lines
     - 338.6
     - 275.9
     - 614.5

**Configuration Notes:**

- ``default`` for ntasks-per-node means 1 (system default)
- ``default`` for cpus-per-task means the parameter is omitted (system default)
- Epoch 0 includes initial data loading, cache warm-up, and JIT compilation
- Epoch 1 shows steady-state performance after optimization
- Maximum total cores constraint: `ntasks-per-node × cpus-per-task ≤ 4`
- Test 5 missing epoch 1 due to job timeout (I/O bottleneck)

**Performance Analysis and Key Findings:**

1. **Dramatic Impact of DataLoader Workers**:

   - Adding `num_workers=4` reduces total time by **68%** (Test 1→2: 1922s → 614s)
   - Without workers, all configurations perform poorly (1867-2148s total time)
   - Workers effectively overlap I/O with GPU computation

2. **Optimal Configuration: Test 2 and Test 8**:

   - **Test 2** (default/4 workers): 614.2s total, 280.7s epoch 1
   - **Test 8** (ntasks=2, cpus=2, workers=4): 614.5s total, 275.9s epoch 1
   - Both achieve **3.1x speedup** over baseline (Test 1)
   - Simple configuration (Test 2) is easiest to implement

3. **Poor Configurations to Avoid**:

   - **Test 5** (ntasks=4, no workers): 1898s epoch 0, incomplete - I/O saturation causes timeout
   - **Test 3 & 7** (workers=0): 1867-2148s total time - CPU cores wasted without workers
   - Adding CPU cores without workers provides **no benefit** (Test 1 vs 3 vs 7)

4. **The Workers Effect**:

   - With workers (Tests 2,4,6,8): Epoch 0: 333-390s, Epoch 1: 276-287s
   - Without workers (Tests 1,3,5,7): Epoch 0: 1115-1898s, Epoch 1: 742-997s
   - Workers reduce epoch 0 time by **70-80%** and epoch 1 time by **62-71%**

5. **CPU Core Allocation Impact**:

   - With workers, CPU core allocation has minimal effect (614-673s total)
   - Without workers, more cores actually hurt performance (Test 5: 1898s vs Test 1: 1922s)
   - Suggests I/O is the primary bottleneck, not CPU compute


**Best Configuration (Simple and Effective):**

.. code-block:: bash

   #SBATCH --ntasks-per-node=1
   #SBATCH --cpus-per-task=1

   # In your training script:
   num_workers=4

Next Steps
----------

- Explore :doc:`neural_architectures` for different model types
- Learn about :doc:`training_strategy` for optimal training
- Check :doc:`inference_modes` for running predictions
- See :doc:`api/modules` for detailed API reference