Python Machine Learning Agent Rules
Project Context
You are building a machine learning project with Python 3.12+, using scikit-learn and/or PyTorch for modeling, pandas/polars for data processing, and MLflow or Weights & Biases for experiment tracking. Reproducibility and clean separation between data, features, and modeling code are primary concerns.
Code Style & Structure
- Use full type hints on all function signatures. Document tensor/array shapes in docstrings: `# shape: (batch, seq_len, hidden_size)`.
- Follow PEP 8. Format with `ruff format`. Lint with `ruff check --select ALL`.
- Avoid magic numbers. Define all hyperparameters in a config dataclass or YAML file. Never hardcode learning rates, batch sizes, or architecture parameters in model files.
- Use `dataclasses` or Pydantic v2 `BaseModel` for configuration objects. Load configs at runtime.
Project Structure
```
data/
raw/ # Original, immutable input data — never modified
processed/ # Cleaned, feature-engineered, ready-for-model data
src/
data/
loaders.py # Dataset loading, splitting (train/val/test)
transforms.py # Feature engineering, preprocessing pipelines
validation.py # Schema validation with pandera
models/
architectures.py # Model class definitions
training.py # Training loop, optimizer setup, scheduling
evaluation.py # Metrics, confusion matrix, threshold analysis
features/ # Feature selection, importance analysis
utils/
reproducibility.py # Seed setting, deterministic flag configuration
logging.py # Structured logging, MLflow run context
configs/ # YAML experiment configs per model variant
notebooks/ # Exploration only — no production code
tests/
models/ # Saved checkpoints, ONNX exports
```
Data Processing
- Validate data at every pipeline entry point: check dtypes, null percentages, value ranges, and row counts.
- Use `pandera` or `great_expectations` schemas to assert data contracts. Fail fast on schema drift.
- Write idempotent pipelines — running the same pipeline twice on the same input must produce identical output.
- Log data statistics at each transformation step: shape, null counts, min/max, mean/std.
- Store intermediate results as Parquet files. Parquet preserves dtypes and is 10–20x faster to read than CSV.
- Use `sklearn.pipeline.Pipeline` to encapsulate preprocessing + model into a single estimator. It prevents data leakage from fit state during cross-validation.
Model Development
- Implement the `fit/predict` contract for custom estimators. Subclass `sklearn.base.BaseEstimator` and `TransformerMixin` or `ClassifierMixin`.
- Keep model architectures (PyTorch `nn.Module`) separate from training loops. Training logic lives in `training.py`.
- Implement early stopping based on validation loss with a patience counter. Never train for a fixed number of epochs without a stopping criterion.
- Use cross-validation (`StratifiedKFold` for classification, `KFold` for regression) for model selection. Use the held-out test set only for final evaluation — never for model selection.
- Save checkpoints including model state, optimizer state, epoch number, and best validation metric.
- Use `torch.inference_mode()` (not `no_grad()`) for evaluation and inference. It disables gradient tracking more completely.
Experiment Tracking
- Track every experiment run. Never run untracked experiments — use MLflow `mlflow.autolog()` or explicit `mlflow.log_params/metrics/artifacts`.
- Log: all hyperparameters, dataset version (hash or DVC tag), git commit hash, per-epoch metrics, and final model artifact.
- Use descriptive run names: `f"resnet50-lr{lr}-bs{batch_size}-{datetime.now():%Y%m%d}"`.
- Register production-ready models in the MLflow Model Registry with stage transitions (Staging → Production).
- Store confusion matrices, ROC curves, and feature importance plots as run artifacts.
Reproducibility
- Set seeds for all random sources: `random.seed(42)`, `np.random.seed(42)`, `torch.manual_seed(42)`, `torch.cuda.manual_seed_all(42)`.
- Set `torch.backends.cudnn.deterministic = True` and `torch.backends.cudnn.benchmark = False` when exact reproducibility is required (costs ~10% throughput).
- Pin all dependency versions in `pyproject.toml` with exact versions. Use `uv lock` or `pip-compile` for lockfiles.
- Use DVC to version datasets. Tag dataset versions in experiment logs.
Testing
- Unit-test data transformations with `pytest`. Assert output shape, dtype, and no-null guarantees.
- Smoke-test the full training pipeline: train for 2 batches on synthetic data. Verify loss decreases.
- Assert model output shapes and dtypes for every model variant.
- Test that serialization round-trips are correct: save → load → predict → same results.
- Run tests in CI on every commit with `pytest --tb=short`.
Performance
- Profile data loading bottlenecks with `torch.utils.data.DataLoader(num_workers=4, prefetch_factor=2, pin_memory=True)` for GPU training.
- Use `torch.compile(model)` (PyTorch 2.0+) for ~20–30% training speedup on modern hardware.
- Use mixed precision training with `torch.cuda.amp.autocast()` and `GradScaler` to reduce memory usage and speed up GPU compute.
- Profile with `torch.profiler` to identify kernel bottlenecks. Use `torch.utils.bottleneck` for quick high-level profiling.