Skip to content

Prepare Training Data

Format and validate your data for training.

Data Format

feature1,feature2,feature3,label
5.1,3.5,1.4,setosa
7.0,3.2,4.7,versicolor
6.3,3.3,6.0,virginica

JSON Format

[
  {
    "feature1": 5.1,
    "feature2": 3.5,
    "feature3": 1.4,
    "label": "setosa"
  }
]

Excel Format

Supported via pandas conversion.

Data Requirements

  • Minimum size: 100 examples
  • Format: CSV, JSON, or Excel
  • Headers: First row must be column names
  • Consistency: All rows have same number of columns
  • Labels: All target values present

Data Organization

my-project/
├── data/
│   ├── train.csv       (70%)
│   ├── test.csv        (30%)
│   └── raw/
│       └── original.csv
└── config.yaml

Validation Checklist

  • [ ] No missing values
  • [ ] Correct column names in config
  • [ ] Balanced classes
  • [ ] No duplicate rows
  • [ ] Proper encoding (UTF-8)
  • [ ] Valid file format

Common Data Issues & Solutions

Missing Values

import pandas as pd

df = pd.read_csv('data.csv')
df = df.dropna()
df.to_csv('data_clean.csv', index=False)

Imbalanced Classes

from sklearn.utils import resample

# Oversample minority class
df_majority = df[df['label'] == 'majority']
df_minority = df[df['label'] == 'minority']

df_minority = resample(df_minority, 
                       n_samples=len(df_majority))
df_balanced = pd.concat([df_majority, df_minority])

Duplicates

df = df.drop_duplicates()
df.to_csv('data_unique.csv', index=False)

Next Steps


For detailed data prep: Data Preparation Guide