Training Basics¶
Master the fundamentals of training AI models.
What is Training?¶
Training is the process of teaching your agent to recognize patterns:
Raw Model Training Trained Model
(Random) (Learn from data) (Knows patterns)
❌ Predictions → Process Data → ✅ Accurate
are random Adjust weights Predictions
The Training Loop¶
┌──────────────────────┐
│ 1. Load batch of │
│ training data │
└──────┬───────────────┘
↓
┌──────────────────────┐
│ 2. Forward pass: │
│ Make prediction │
└──────┬───────────────┘
↓
┌──────────────────────┐
│ 3. Calculate loss: │
│ How wrong? │
└──────┬───────────────┘
↓
┌──────────────────────┐
│ 4. Backward pass: │
│ Adjust weights │
└──────┬───────────────┘
↓
┌──────────────────────┐
│ 5. Repeat next │
│ batch (epoch) │
└──────┬───────────────┘
↓
Done?
/ \
NO YES
↓ ↓
Loop Finished
Training Hyperparameters¶
Epochs¶
Definition: Number of times the model sees all training data
Epoch 1: Process all 1000 examples
Epoch 2: Process all 1000 examples again
Epoch 3: Process all 1000 examples again
...
Epoch 20: Done!
Effect: - Few epochs → Model underfits - Many epochs → Risk of overfitting - Sweet spot: 10-50 for most tasks
How to choose:
training:
epochs: 20
early_stopping: true # Stop if validation stops improving
patience: 5
Batch Size¶
Definition: Number of examples processed before updating weights
Total data: 1000 examples
Batch size: 32
Batches per epoch: 1000 / 32 = 31.25 → 32 batches
Effect: - Larger batch → Faster, less accurate updates - Smaller batch → Slower, more accurate updates - Sweet spot: 32-128
Memory vs Speed Trade-off:
Batch Size Memory Speed Stability
32 Low Slow High
64 Medium Medium Medium
128 High Fast Lower
256 Very High Very Fast Low
How to choose: - GPU available: Use 128-256 - Limited memory: Use 32-64 - Large dataset: Can use larger batches
Learning Rate¶
Definition: How much to adjust weights each step
Too Low (0.00001):
Epoch 1: loss = 0.5
Epoch 2: loss = 0.499
Epoch 100: loss = 0.45
(Very slow improvement)
Good (0.001):
Epoch 1: loss = 0.5
Epoch 2: loss = 0.3
Epoch 5: loss = 0.1
(Steady improvement)
Too High (0.1):
Epoch 1: loss = 0.5
Epoch 2: loss = 1.2
Epoch 3: loss = 3.5
(Diverges! Gets worse)
How to choose:
training:
optimizer: adam
learning_rate: 0.001 # Good starting point
# Or use learning rate scheduling
scheduler: cosine_annealing
Learning Rate Schedule:
Large rate early (fast learning)
↓
Decrease over time (fine-tuning)
↓
Small final rate (stable)
Validation & Test Sets¶
Data Split¶
Total Data (100%)
│
├─ Training Set (70%) ← Model learns from this
├─ Validation Set (15%) ← Check during training
└─ Test Set (15%) ← Final evaluation
Why Split?¶
Without split:
Model remembers training data
Training accuracy: 99%
Real accuracy: 60% ❌
With split:
Model learns patterns
Training accuracy: 90%
Test accuracy: 87% ✅ Generalizes well
How to Split¶
data:
train_file: train.csv
validation_split: 0.2 # 20% for validation
test_size: 0.2 # 20% for testing
# Results in: 60% train, 20% val, 20% test
Monitoring Training¶
Key Metrics to Watch¶
Training Loss: Should decrease
Training Accuracy: Should increase
Validation Loss: Should decrease (but smoother than training)
Validation Accuracy: Should increase
Good pattern:
Loss: ↘ ↘ ↘ (decreasing)
Accuracy: ↗ ↗ ↗ (increasing)
Bad pattern - Overfitting:
Training Loss: ↘ (decreasing)
Validation Loss: ↗ (increasing!) ❌
Bad pattern - Underfitting:
Both losses: ↘ but plateau too high ❌
Training Curves¶
Normal Training:
Accuracy
| ___
| /
| /
|__/____________
├─────────────── Epoch
Overfitting:
Accuracy
| /╲
| / ╲___
| / Val
|/Train
├─────────────── Epoch
Underfitting:
Accuracy
| ___________
| /
|/
├─────────────── Epoch
Learning Rate Strategies¶
Fixed Learning Rate¶
learning_rate: 0.001
# Same throughout training
# Simple but suboptimal
Decreasing Learning Rate¶
learning_rate: 0.001
scheduler: step
step_size: 10
gamma: 0.1
# Divide by 10 every 10 epochs
Adaptive Learning Rate (Adam)¶
optimizer: adam
learning_rate: 0.001
# Automatically adjusts per parameter
# Recommended for most cases
Optimization Algorithms¶
SGD (Stochastic Gradient Descent)¶
✓ Simple and reliable
✗ Can get stuck in local minima
✗ Slow convergence
Use: When interpretability matters
Momentum¶
✓ Faster convergence than SGD
✓ Better handling of local minima
✗ More complex
Use: Most traditional ML tasks
Adam (Adaptive Moment Estimation)¶
✓ Fast convergence
✓ Handles different learning rates per parameter
✓ Works well for deep learning
✗ More memory usage
Use: Recommended for IOValence (default)
RMSprop¶
✓ Good for RNNs
✓ Memory efficient
✗ More hyperparameters to tune
Use: When training RNNs
Training Configuration Template¶
training:
# Model architecture
model_type: neural_network
layers: [128, 64, 32]
activation: relu
# Training parameters
epochs: 30
batch_size: 32
validation_split: 0.2
# Optimization
optimizer: adam
learning_rate: 0.001
scheduler: cosine_annealing
# Regularization (prevent overfitting)
dropout: 0.3
l2_regularization: 0.0001
# Early stopping
early_stopping: true
patience: 5
min_delta: 0.001
# Checkpointing
save_checkpoint: true
checkpoint_interval: 5
# Debugging
verbose: true
seed: 42 # Reproducible results
Common Training Scenarios¶
Scenario 1: Quick Prototype¶
training:
epochs: 5
batch_size: 64
learning_rate: 0.01
# Fast but lower accuracy
Scenario 2: Balanced¶
training:
epochs: 20
batch_size: 32
learning_rate: 0.001
early_stopping: true
# Good balance
Scenario 3: Production Ready¶
training:
epochs: 100
batch_size: 16
learning_rate: 0.001
scheduler: cosine_annealing
early_stopping: true
patience: 10
dropout: 0.3
# Best quality, slower
Scenario 4: Limited Data¶
training:
epochs: 50
batch_size: 16 # Smaller batches with small data
learning_rate: 0.0001 # Very small
dropout: 0.5 # Heavy regularization
l2_regularization: 0.001
# Prevent overfitting with little data
Troubleshooting Training¶
Model not learning (flat loss)¶
# Increase learning rate
learning_rate: 0.01
# More complex model
layers: [256, 128, 64]
# Check data quality
# (Is data labeled correctly?)
Training too slow¶
# Increase batch size
batch_size: 128
# Fewer layers
layers: [64, 32]
# Use GPU
# (pip install iovalence[gpu])
Overfitting (train good, test bad)¶
# More regularization
dropout: 0.5
l2_regularization: 0.001
# More data
# (Collect more examples)
# Simpler model
layers: [32, 16]
# Early stopping
early_stopping: true
patience: 3
Training Best Practices¶
- Always use validation set
- Monitor validation metrics
-
Stop if not improving
-
Start simple
- Simple model → good baseline
-
Add complexity only if needed
-
Check your data
- Remove duplicates
- Fix labels
-
Ensure balance
-
Use reproducible settings
seed: 42 # Fixed random seed -
Save checkpoints
save_checkpoint: true -
Document your settings
# Include why you chose each value comments: "Testing learning rate 0.01 to improve convergence"
Next Steps¶
- 🔍 Evaluate Performance - Measure results
- 📊 Prepare Data - Better data = better results
- 🤖 Create Agent - Put it to use
Ready to train your first model? Get Started →