Data Preparation Guide¶
Quality data is the foundation of good agents. Learn how to prepare it properly.
Why Data Matters¶
Good Data Good Model
(Quality, Clean) → (Accurate Predictions)
80% ~20%
Bad Data Poor Model
(Messy, Imbalanced) → (Unreliable)
80% ~20%
Data quality matters 4x more than model complexity!
Data Quality Checklist¶
- [ ] Complete - No missing values (or handled)
- [ ] Accurate - Correct labels and values
- [ ] Balanced - Similar examples per class
- [ ] Relevant - Related to your task
- [ ] Sufficient - Enough examples (100+)
- [ ] Diverse - Various patterns/scenarios
- [ ] Clean - No errors, duplicates, or noise
Step 1: Data Cleaning¶
Identify Missing Values¶
import pandas as pd
df = pd.read_csv('data.csv')
# Find missing values
print(df.isnull().sum())
# Visualize missing data
print(df.isnull())
Handle Missing Data¶
Option 1: Remove rows
# Remove rows with ANY missing values
df_clean = df.dropna()
# Remove rows missing in specific column
df_clean = df.dropna(subset=['important_column'])
Option 2: Fill missing values
# Fill with mean (numeric)
df['age'].fillna(df['age'].mean(), inplace=True)
# Fill with constant (any type)
df['category'].fillna('unknown', inplace=True)
# Fill with forward fill (time series)
df.fillna(method='ffill', inplace=True)
Remove Duplicates¶
# Find duplicates
print(df[df.duplicated()])
# Remove duplicates
df_clean = df.drop_duplicates()
# Remove duplicates based on specific columns
df_clean = df.drop_duplicates(subset=['text', 'label'])
Fix Data Types¶
# Convert to correct type
df['age'] = df['age'].astype(int)
df['price'] = df['price'].astype(float)
df['date'] = pd.to_datetime(df['date'])
# Check data types
print(df.dtypes)
Step 2: Text Preprocessing (For Text Data)¶
Lowercase Conversion¶
df['text'] = df['text'].str.lower()
Remove Special Characters¶
import re
def clean_text(text):
# Remove special chars, keep alphanumeric and spaces
text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
return text
df['text'] = df['text'].apply(clean_text)
Tokenization¶
# Simple space-based tokenization
df['tokens'] = df['text'].str.split()
# Or use advanced tokenization
from nltk.tokenize import word_tokenize
df['tokens'] = df['text'].apply(word_tokenize)
Remove Stopwords¶
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
def remove_stopwords(tokens):
return [t for t in tokens if t not in stop_words]
df['tokens'] = df['tokens'].apply(remove_stopwords)
Stemming/Lemmatization¶
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
def stem_tokens(tokens):
return [stemmer.stem(t) for t in tokens]
df['tokens'] = df['tokens'].apply(stem_tokens)
Step 3: Feature Scaling (For Numeric Data)¶
Normalization (0-1)¶
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])
# Result: values between 0 and 1
Standardization (Mean=0, Std=1)¶
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])
# Result: centered around 0
Step 4: Handle Imbalanced Data¶
Check Balance¶
# Count samples per class
print(df['label'].value_counts())
# Percentage
print(df['label'].value_counts(normalize=True))
# Example output (imbalanced):
# positive: 800 (80%)
# negative: 200 (20%) ← Imbalanced!
Resampling Techniques¶
Oversampling (Duplicate minority class)
from sklearn.utils import resample
# Separate by class
df_majority = df[df['label'] == 'positive']
df_minority = df[df['label'] == 'negative']
# Oversample minority
df_minority_oversampled = resample(
df_minority,
replace=True,
n_samples=len(df_majority),
random_state=42
)
# Combine
df_balanced = pd.concat([df_majority, df_minority_oversampled])
Undersampling (Reduce majority class)
# Undersample majority
df_majority_undersampled = resample(
df_majority,
replace=False,
n_samples=len(df_minority),
random_state=42
)
# Combine
df_balanced = pd.concat([df_majority_undersampled, df_minority])
SMOTE (Synthetic Minority Oversampling)
from imblearn.over_sampling import SMOTE
X = df[['feature1', 'feature2', ...]]
y = df['label']
smote = SMOTE(random_state=42)
X_balanced, y_balanced = smote.fit_resample(X, y)
Step 5: Encoding Categorical Variables¶
One-Hot Encoding¶
# For non-ordinal categories (color, category)
df = pd.get_dummies(df, columns=['color'], prefix='color')
# Before: color: [red, blue, green]
# After: color_red: [1, 0, 0]
# color_blue: [0, 1, 0]
# color_green: [0, 0, 1]
Label Encoding¶
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['label'] = le.fit_transform(df['label'])
# Before: [positive, negative, neutral]
# After: [1, 0, 2]
Target Encoding¶
# Encode based on target mean
target_encoding = df.groupby('category')['target'].mean()
df['category_encoded'] = df['category'].map(target_encoding)
Step 6: Feature Engineering¶
Creating New Features¶
# From existing columns
df['name_length'] = df['name'].str.len()
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 65, 100])
df['is_expensive'] = (df['price'] > 100).astype(int)
# Time-based features
df['date'] = pd.to_datetime(df['date'])
df['day_of_week'] = df['date'].dt.dayofweek
df['month'] = df['date'].dt.month
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
Feature Selection¶
from sklearn.feature_selection import SelectKBest, f_classif
# Select top 10 features
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)
# Get feature names
selected_features = X.columns[selector.get_support()]
print(selected_features)
Step 7: Data Validation¶
Check Data Quality¶
# Summary statistics
print(df.describe())
# Data types
print(df.dtypes)
# Missing values
print(df.isnull().sum())
# Duplicates
print(df.duplicated().sum())
# Distribution
print(df['label'].value_counts())
Visualize Distribution¶
import matplotlib.pyplot as plt
# Class distribution
df['label'].value_counts().plot(kind='bar')
plt.title('Class Distribution')
plt.show()
# Feature distribution
df['age'].hist(bins=30)
plt.title('Age Distribution')
plt.show()
# Correlation heatmap
import seaborn as sns
sns.heatmap(df.corr(), annot=True)
plt.show()
Data Preparation Workflow¶
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample
# 1. Load data
df = pd.read_csv('raw_data.csv')
# 2. Clean
df = df.dropna()
df = df.drop_duplicates()
# 3. Preprocess text (if applicable)
df['text'] = df['text'].str.lower()
# 4. Handle imbalance
df_majority = df[df['label'] == 'positive']
df_minority = df[df['label'] == 'negative']
df_minority = resample(df_minority, n_samples=len(df_majority))
df = pd.concat([df_majority, df_minority])
# 5. Scale numeric features
scaler = StandardScaler()
numeric_cols = ['age', 'income']
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
# 6. Encode categorical
df = pd.get_dummies(df, columns=['category'])
# 7. Split data
train = df.sample(frac=0.8)
test = df.drop(train.index)
# 8. Save
train.to_csv('train.csv', index=False)
test.to_csv('test.csv', index=False)
Common Data Issues¶
| Issue | Impact | Solution |
|---|---|---|
| Missing values | Errors | Fill or drop |
| Duplicates | Biased training | Remove |
| Imbalanced | Poor minority class | Resample |
| Wrong labels | Wrong learning | Manual review |
| Outliers | Bad model | Remove or cap |
| Mixed scales | Unfair weights | Normalize |
| Correlated features | Redundancy | Remove one |
Data Privacy & Ethics¶
- Remove PII: Names, emails, phone numbers
- Anonymize: Replace identifiers
- Aggregate: Group sensitive data
- Consent: Ensure data usage permission
- Bias: Check for fairness across groups
# Remove sensitive columns
df = df.drop(columns=['name', 'email', 'phone'])
# Anonymize IDs
df['customer_id'] = range(len(df))
Best Practices Summary¶
- Understand your data first
-
Explore, visualize, summarize
-
Clean thoroughly
-
Missing values, duplicates, errors
-
Check balance
-
Ensure fair class representation
-
Normalize/scale
-
Bring features to comparable ranges
-
Validate quality
-
Use data quality tools and checks
-
Document steps
-
Others should reproduce your prep
-
Split properly
- Train/validation/test with no overlap
Next Steps¶
- 🚀 Create Your Agent - Use prepared data
- 🧠 Training Basics - Train with good data
- 📊 Evaluate Performance - Measure results
Data ready? Train Your Agent →