Prepare Training Data¶
Format and validate your data for training.
Data Format¶
CSV Format (Recommended)¶
feature1,feature2,feature3,label
5.1,3.5,1.4,setosa
7.0,3.2,4.7,versicolor
6.3,3.3,6.0,virginica
JSON Format¶
[
{
"feature1": 5.1,
"feature2": 3.5,
"feature3": 1.4,
"label": "setosa"
}
]
Excel Format¶
Supported via pandas conversion.
Data Requirements¶
- Minimum size: 100 examples
- Format: CSV, JSON, or Excel
- Headers: First row must be column names
- Consistency: All rows have same number of columns
- Labels: All target values present
Data Organization¶
my-project/
├── data/
│ ├── train.csv (70%)
│ ├── test.csv (30%)
│ └── raw/
│ └── original.csv
└── config.yaml
Validation Checklist¶
- [ ] No missing values
- [ ] Correct column names in config
- [ ] Balanced classes
- [ ] No duplicate rows
- [ ] Proper encoding (UTF-8)
- [ ] Valid file format
Common Data Issues & Solutions¶
Missing Values¶
import pandas as pd
df = pd.read_csv('data.csv')
df = df.dropna()
df.to_csv('data_clean.csv', index=False)
Imbalanced Classes¶
from sklearn.utils import resample
# Oversample minority class
df_majority = df[df['label'] == 'majority']
df_minority = df[df['label'] == 'minority']
df_minority = resample(df_minority,
n_samples=len(df_majority))
df_balanced = pd.concat([df_majority, df_minority])
Duplicates¶
df = df.drop_duplicates()
df.to_csv('data_unique.csv', index=False)
Next Steps¶
- 🚀 Create Agent
- 🧠 Train Model
- 📈 Evaluate
For detailed data prep: Data Preparation Guide