Training Basics¶

Master the fundamentals of training AI models.

What is Training?¶

Training is the process of teaching your agent to recognize patterns:

Raw Model              Training              Trained Model
(Random)          (Learn from data)      (Knows patterns)

❌ Predictions        → Process Data →      ✅ Accurate
  are random           Adjust weights         Predictions

The Training Loop¶

┌──────────────────────┐
│  1. Load batch of    │
│     training data    │
└──────┬───────────────┘
       ↓
┌──────────────────────┐
│  2. Forward pass:    │
│     Make prediction  │
└──────┬───────────────┘
       ↓
┌──────────────────────┐
│  3. Calculate loss:  │
│     How wrong?       │
└──────┬───────────────┘
       ↓
┌──────────────────────┐
│  4. Backward pass:   │
│     Adjust weights   │
└──────┬───────────────┘
       ↓
┌──────────────────────┐
│  5. Repeat next      │
│     batch (epoch)    │
└──────┬───────────────┘
       ↓
    Done?
   /     \
  NO      YES
  ↓       ↓
Loop    Finished

Training Hyperparameters¶

Epochs¶

Definition: Number of times the model sees all training data

Epoch 1: Process all 1000 examples
Epoch 2: Process all 1000 examples again
Epoch 3: Process all 1000 examples again
...
Epoch 20: Done!

Effect: - Few epochs → Model underfits - Many epochs → Risk of overfitting - Sweet spot: 10-50 for most tasks

How to choose:

training:
  epochs: 20
  early_stopping: true  # Stop if validation stops improving
  patience: 5

Batch Size¶

Definition: Number of examples processed before updating weights

Total data: 1000 examples
Batch size: 32

Batches per epoch: 1000 / 32 = 31.25 → 32 batches

Effect: - Larger batch → Faster, less accurate updates - Smaller batch → Slower, more accurate updates - Sweet spot: 32-128

Memory vs Speed Trade-off:

Batch Size   Memory   Speed   Stability
32           Low      Slow    High
64           Medium   Medium  Medium
128          High     Fast    Lower
256          Very High Very Fast Low

How to choose: - GPU available: Use 128-256 - Limited memory: Use 32-64 - Large dataset: Can use larger batches

Learning Rate¶

Definition: How much to adjust weights each step

Too Low (0.00001):
  Epoch 1: loss = 0.5
  Epoch 2: loss = 0.499
  Epoch 100: loss = 0.45
  (Very slow improvement)

Good (0.001):
  Epoch 1: loss = 0.5
  Epoch 2: loss = 0.3
  Epoch 5: loss = 0.1
  (Steady improvement)

Too High (0.1):
  Epoch 1: loss = 0.5
  Epoch 2: loss = 1.2
  Epoch 3: loss = 3.5
  (Diverges! Gets worse)

How to choose:

training:
  optimizer: adam
  learning_rate: 0.001  # Good starting point

  # Or use learning rate scheduling
  scheduler: cosine_annealing

Learning Rate Schedule:

Large rate early (fast learning)
  ↓
Decrease over time (fine-tuning)
  ↓
Small final rate (stable)

Validation & Test Sets¶

Data Split¶

Total Data (100%)
│
├─ Training Set (70%)    ← Model learns from this
├─ Validation Set (15%)  ← Check during training
└─ Test Set (15%)        ← Final evaluation

Why Split?¶

Without split:
Model remembers training data
Training accuracy: 99%
Real accuracy: 60% ❌

With split:
Model learns patterns
Training accuracy: 90%
Test accuracy: 87% ✅ Generalizes well

How to Split¶

data:
  train_file: train.csv
  validation_split: 0.2  # 20% for validation
  test_size: 0.2         # 20% for testing

# Results in: 60% train, 20% val, 20% test

Monitoring Training¶

Key Metrics to Watch¶

Training Loss:  Should decrease
Training Accuracy: Should increase
Validation Loss: Should decrease (but smoother than training)
Validation Accuracy: Should increase

Good pattern:
  Loss: ↘ ↘ ↘ (decreasing)
  Accuracy: ↗ ↗ ↗ (increasing)

Bad pattern - Overfitting:
  Training Loss: ↘ (decreasing)
  Validation Loss: ↗ (increasing!) ❌

Bad pattern - Underfitting:
  Both losses: ↘ but plateau too high ❌

Training Curves¶

Normal Training:
Accuracy
  |     ___
  |    /
  |   /
  |__/____________
  ├─────────────── Epoch

Overfitting:
Accuracy
  |   /╲
  |  /  ╲___
  | /   Val
  |/Train
  ├─────────────── Epoch

Underfitting:
Accuracy
  |  ___________
  | /
  |/
  ├─────────────── Epoch

Learning Rate Strategies¶

Fixed Learning Rate¶

learning_rate: 0.001
# Same throughout training
# Simple but suboptimal

Decreasing Learning Rate¶

learning_rate: 0.001
scheduler: step
step_size: 10
gamma: 0.1
# Divide by 10 every 10 epochs

Adaptive Learning Rate (Adam)¶

optimizer: adam
learning_rate: 0.001
# Automatically adjusts per parameter
# Recommended for most cases

Optimization Algorithms¶

SGD (Stochastic Gradient Descent)¶

✓ Simple and reliable
✗ Can get stuck in local minima
✗ Slow convergence
Use: When interpretability matters

Momentum¶

✓ Faster convergence than SGD
✓ Better handling of local minima
✗ More complex
Use: Most traditional ML tasks

Adam (Adaptive Moment Estimation)¶

✓ Fast convergence
✓ Handles different learning rates per parameter
✓ Works well for deep learning
✗ More memory usage
Use: Recommended for IOValence (default)

RMSprop¶

✓ Good for RNNs
✓ Memory efficient
✗ More hyperparameters to tune
Use: When training RNNs

Training Configuration Template¶

training:
  # Model architecture
  model_type: neural_network
  layers: [128, 64, 32]
  activation: relu

  # Training parameters
  epochs: 30
  batch_size: 32
  validation_split: 0.2

  # Optimization
  optimizer: adam
  learning_rate: 0.001
  scheduler: cosine_annealing

  # Regularization (prevent overfitting)
  dropout: 0.3
  l2_regularization: 0.0001

  # Early stopping
  early_stopping: true
  patience: 5
  min_delta: 0.001

  # Checkpointing
  save_checkpoint: true
  checkpoint_interval: 5

  # Debugging
  verbose: true
  seed: 42  # Reproducible results

Common Training Scenarios¶

Scenario 1: Quick Prototype¶

training:
  epochs: 5
  batch_size: 64
  learning_rate: 0.01
  # Fast but lower accuracy

Scenario 2: Balanced¶

training:
  epochs: 20
  batch_size: 32
  learning_rate: 0.001
  early_stopping: true
  # Good balance

Scenario 3: Production Ready¶

training:
  epochs: 100
  batch_size: 16
  learning_rate: 0.001
  scheduler: cosine_annealing
  early_stopping: true
  patience: 10
  dropout: 0.3
  # Best quality, slower

Scenario 4: Limited Data¶

training:
  epochs: 50
  batch_size: 16  # Smaller batches with small data
  learning_rate: 0.0001  # Very small
  dropout: 0.5  # Heavy regularization
  l2_regularization: 0.001
  # Prevent overfitting with little data

Troubleshooting Training¶

Model not learning (flat loss)¶

# Increase learning rate
learning_rate: 0.01

# More complex model
layers: [256, 128, 64]

# Check data quality
# (Is data labeled correctly?)

Training too slow¶

# Increase batch size
batch_size: 128

# Fewer layers
layers: [64, 32]

# Use GPU
# (pip install iovalence[gpu])

Overfitting (train good, test bad)¶

# More regularization
dropout: 0.5
l2_regularization: 0.001

# More data
# (Collect more examples)

# Simpler model
layers: [32, 16]

# Early stopping
early_stopping: true
patience: 3

Training Best Practices¶

Always use validation set
Monitor validation metrics
Stop if not improving
Start simple
Simple model → good baseline
Add complexity only if needed
Check your data
Remove duplicates
Fix labels
Ensure balance
Use reproducible settings
```
seed: 42  # Fixed random seed
```
Save checkpoints
```
save_checkpoint: true
```

Document your settings

# Include why you chose each value
comments: "Testing learning rate 0.01 to improve convergence"

Next Steps¶

🔍 Evaluate Performance - Measure results
📊 Prepare Data - Better data = better results
🤖 Create Agent - Put it to use

Ready to train your first model? Get Started →