Skip to content

Data Preparation Guide

Quality data is the foundation of good agents. Learn how to prepare it properly.

Why Data Matters

Good Data           Good Model
(Quality, Clean) →  (Accurate Predictions)
    80%                  ~20%

Bad Data            Poor Model
(Messy, Imbalanced) → (Unreliable)
    80%                  ~20%

Data quality matters 4x more than model complexity!

Data Quality Checklist

  • [ ] Complete - No missing values (or handled)
  • [ ] Accurate - Correct labels and values
  • [ ] Balanced - Similar examples per class
  • [ ] Relevant - Related to your task
  • [ ] Sufficient - Enough examples (100+)
  • [ ] Diverse - Various patterns/scenarios
  • [ ] Clean - No errors, duplicates, or noise

Step 1: Data Cleaning

Identify Missing Values

import pandas as pd

df = pd.read_csv('data.csv')

# Find missing values
print(df.isnull().sum())

# Visualize missing data
print(df.isnull())

Handle Missing Data

Option 1: Remove rows

# Remove rows with ANY missing values
df_clean = df.dropna()

# Remove rows missing in specific column
df_clean = df.dropna(subset=['important_column'])

Option 2: Fill missing values

# Fill with mean (numeric)
df['age'].fillna(df['age'].mean(), inplace=True)

# Fill with constant (any type)
df['category'].fillna('unknown', inplace=True)

# Fill with forward fill (time series)
df.fillna(method='ffill', inplace=True)

Remove Duplicates

# Find duplicates
print(df[df.duplicated()])

# Remove duplicates
df_clean = df.drop_duplicates()

# Remove duplicates based on specific columns
df_clean = df.drop_duplicates(subset=['text', 'label'])

Fix Data Types

# Convert to correct type
df['age'] = df['age'].astype(int)
df['price'] = df['price'].astype(float)
df['date'] = pd.to_datetime(df['date'])

# Check data types
print(df.dtypes)

Step 2: Text Preprocessing (For Text Data)

Lowercase Conversion

df['text'] = df['text'].str.lower()

Remove Special Characters

import re

def clean_text(text):
    # Remove special chars, keep alphanumeric and spaces
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return text

df['text'] = df['text'].apply(clean_text)

Tokenization

# Simple space-based tokenization
df['tokens'] = df['text'].str.split()

# Or use advanced tokenization
from nltk.tokenize import word_tokenize
df['tokens'] = df['text'].apply(word_tokenize)

Remove Stopwords

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

def remove_stopwords(tokens):
    return [t for t in tokens if t not in stop_words]

df['tokens'] = df['tokens'].apply(remove_stopwords)

Stemming/Lemmatization

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

def stem_tokens(tokens):
    return [stemmer.stem(t) for t in tokens]

df['tokens'] = df['tokens'].apply(stem_tokens)

Step 3: Feature Scaling (For Numeric Data)

Normalization (0-1)

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])

# Result: values between 0 and 1

Standardization (Mean=0, Std=1)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])

# Result: centered around 0

Step 4: Handle Imbalanced Data

Check Balance

# Count samples per class
print(df['label'].value_counts())

# Percentage
print(df['label'].value_counts(normalize=True))

# Example output (imbalanced):
# positive: 800 (80%)
# negative: 200 (20%)  ← Imbalanced!

Resampling Techniques

Oversampling (Duplicate minority class)

from sklearn.utils import resample

# Separate by class
df_majority = df[df['label'] == 'positive']
df_minority = df[df['label'] == 'negative']

# Oversample minority
df_minority_oversampled = resample(
    df_minority,
    replace=True,
    n_samples=len(df_majority),
    random_state=42
)

# Combine
df_balanced = pd.concat([df_majority, df_minority_oversampled])

Undersampling (Reduce majority class)

# Undersample majority
df_majority_undersampled = resample(
    df_majority,
    replace=False,
    n_samples=len(df_minority),
    random_state=42
)

# Combine
df_balanced = pd.concat([df_majority_undersampled, df_minority])

SMOTE (Synthetic Minority Oversampling)

from imblearn.over_sampling import SMOTE

X = df[['feature1', 'feature2', ...]]
y = df['label']

smote = SMOTE(random_state=42)
X_balanced, y_balanced = smote.fit_resample(X, y)

Step 5: Encoding Categorical Variables

One-Hot Encoding

# For non-ordinal categories (color, category)
df = pd.get_dummies(df, columns=['color'], prefix='color')

# Before:  color: [red, blue, green]
# After:   color_red: [1, 0, 0]
#          color_blue: [0, 1, 0]
#          color_green: [0, 0, 1]

Label Encoding

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['label'] = le.fit_transform(df['label'])

# Before: [positive, negative, neutral]
# After:  [1, 0, 2]

Target Encoding

# Encode based on target mean
target_encoding = df.groupby('category')['target'].mean()
df['category_encoded'] = df['category'].map(target_encoding)

Step 6: Feature Engineering

Creating New Features

# From existing columns
df['name_length'] = df['name'].str.len()
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 65, 100])
df['is_expensive'] = (df['price'] > 100).astype(int)

# Time-based features
df['date'] = pd.to_datetime(df['date'])
df['day_of_week'] = df['date'].dt.dayofweek
df['month'] = df['date'].dt.month
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)

Feature Selection

from sklearn.feature_selection import SelectKBest, f_classif

# Select top 10 features
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)

# Get feature names
selected_features = X.columns[selector.get_support()]
print(selected_features)

Step 7: Data Validation

Check Data Quality

# Summary statistics
print(df.describe())

# Data types
print(df.dtypes)

# Missing values
print(df.isnull().sum())

# Duplicates
print(df.duplicated().sum())

# Distribution
print(df['label'].value_counts())

Visualize Distribution

import matplotlib.pyplot as plt

# Class distribution
df['label'].value_counts().plot(kind='bar')
plt.title('Class Distribution')
plt.show()

# Feature distribution
df['age'].hist(bins=30)
plt.title('Age Distribution')
plt.show()

# Correlation heatmap
import seaborn as sns
sns.heatmap(df.corr(), annot=True)
plt.show()

Data Preparation Workflow

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample

# 1. Load data
df = pd.read_csv('raw_data.csv')

# 2. Clean
df = df.dropna()
df = df.drop_duplicates()

# 3. Preprocess text (if applicable)
df['text'] = df['text'].str.lower()

# 4. Handle imbalance
df_majority = df[df['label'] == 'positive']
df_minority = df[df['label'] == 'negative']
df_minority = resample(df_minority, n_samples=len(df_majority))
df = pd.concat([df_majority, df_minority])

# 5. Scale numeric features
scaler = StandardScaler()
numeric_cols = ['age', 'income']
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

# 6. Encode categorical
df = pd.get_dummies(df, columns=['category'])

# 7. Split data
train = df.sample(frac=0.8)
test = df.drop(train.index)

# 8. Save
train.to_csv('train.csv', index=False)
test.to_csv('test.csv', index=False)

Common Data Issues

Issue Impact Solution
Missing values Errors Fill or drop
Duplicates Biased training Remove
Imbalanced Poor minority class Resample
Wrong labels Wrong learning Manual review
Outliers Bad model Remove or cap
Mixed scales Unfair weights Normalize
Correlated features Redundancy Remove one

Data Privacy & Ethics

  • Remove PII: Names, emails, phone numbers
  • Anonymize: Replace identifiers
  • Aggregate: Group sensitive data
  • Consent: Ensure data usage permission
  • Bias: Check for fairness across groups
# Remove sensitive columns
df = df.drop(columns=['name', 'email', 'phone'])

# Anonymize IDs
df['customer_id'] = range(len(df))

Best Practices Summary

  1. Understand your data first
  2. Explore, visualize, summarize

  3. Clean thoroughly

  4. Missing values, duplicates, errors

  5. Check balance

  6. Ensure fair class representation

  7. Normalize/scale

  8. Bring features to comparable ranges

  9. Validate quality

  10. Use data quality tools and checks

  11. Document steps

  12. Others should reproduce your prep

  13. Split properly

  14. Train/validation/test with no overlap

Next Steps


Data ready? Train Your Agent →