Agent RulesAgent Rules
Builder
Options
Browse all rules by language and framework
Templates
Pre-built rule sets ready to use
Popular Rules
Top community-ranked rules leaderboard
GuidesAnalyzePricingContact
Builder
OptionsTemplatesPopular Rules
GuidesAnalyzePricingContact

Product

  • Builder
  • Templates
  • Browse Rules
  • My Library

Learn

  • What are AI Agent Rules?
  • Guides
  • FAQ
  • About

Resources

  • Terms
  • Privacy Policy
  • Pricing
  • Contact
  • DMCA Policy

Support

Help keep this project free.

Agent RulesAgent Rules Builder
© 2026 Aurora Algorithm Inc.
Back to Templates

Python + Scrapy

Rules for Scrapy web scraping projects covering spider design, item pipelines, middleware, rate limiting, data cleaning, and ethical scraping practices.

pythonPython/scrapyScrapy
python
scrapy
web-scraping
data-extraction
spiders
Customize in Builder

Details

Language
pythonPython
Framework
scrapyScrapy

Rules Content

AGENTS.md
Edit in Builder

Python + Scrapy Agent Rules

Project Context

You are building web scrapers with the Scrapy framework. Scrapy's asynchronous Twisted engine crawls multiple pages concurrently. The design priorities are: polite crawling, clean data extraction through item loaders, and robust error handling for production-grade reliability.

Code Style & Structure

- Follow PEP 8. Use `ruff` for linting and formatting. Add type hints to all spider attributes and method signatures.
- Document every spider class with a docstring: target site, data extracted, expected output format, and known limitations.
- Define URL constants and CSS/XPath selectors as class-level attributes, not inline strings.
- Use `logging.getLogger(__name__)` for all logging. Never use bare `print()` statements.

Project Structure

```
project/
spiders/
products.py # scrapy.Spider or CrawlSpider subclasses
listings.py
items.py # Item dataclasses or scrapy.Item subclasses
loaders.py # ItemLoader subclasses with field processors
pipelines.py # Validation, dedup, storage, export pipelines
middlewares.py # Downloader and Spider middleware
settings/
base.py
development.py
production.py
tests/
fixtures/ # Saved HTML response files for unit tests
test_spiders.py
test_pipelines.py
```

Spider Patterns

- Inherit from `scrapy.Spider` for targeted scraping. Use `CrawlSpider` with `Rule` + `LinkExtractor` for recursive link following.
- Use `start_requests()` instead of `start_urls` when URLs need dynamic construction or authentication headers.
- Attach an `errback` to every `yield Request(url, callback=self.parse, errback=self.handle_error)` call. Log and handle failures explicitly.
- Never store mutable per-request state on `self` — concurrent requests share the spider instance. Pass context with `cb_kwargs` or `meta`.
- Set `custom_settings` on individual spiders to override concurrency and delay for that domain only.
- Implement pagination by yielding the next-page Request from the parse callback, not via a loop.

Item Loaders

- Use `ItemLoader` with field-specific `input_processor` and `output_processor` for every field.
- Apply `MapCompose(str.strip, remove_tags)` as the default `input_processor` for text fields.
- Use `TakeFirst()` as the default `output_processor` for single-value fields.
- Normalize at the loader level: dates to ISO 8601, prices to `Decimal`, relative URLs to absolute with `urljoin`.
- Validate required fields in a validation pipeline, not inside the spider's parse method.

Pipelines

- Order pipelines explicitly by their `ITEM_PIPELINES` priority value: validation (100), dedup (200), cleaning (300), storage (400), export (500).
- Implement a validation pipeline that calls `raise DropItem(f'Missing {field}')` for items lacking required fields.
- Implement a deduplication pipeline using a `set()` of seen identifiers or a Bloom filter for large crawls.
- Implement `open_spider(self, spider)` and `close_spider(self, spider)` for resource management (database connections, file handles).
- Use Scrapy's `ImagesPipeline` or `FilesPipeline` for media downloads. Set `FILES_STORE` to a cloud bucket path.

Middleware Configuration

- Enable `AutoThrottle` in production: `AUTOTHROTTLE_ENABLED = True`, `AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0`, `AUTOTHROTTLE_MAX_DELAY = 30`.
- Set `DOWNLOAD_DELAY = 1.0` as a floor. AutoThrottle adjusts upward based on server response times.
- Rotate User-Agent strings with `scrapy-fake-useragent` or a custom `RandomUserAgentMiddleware`.
- Use `RetryMiddleware` with `RETRY_TIMES = 3` and add `503`, `429` to `RETRY_HTTP_CODES`.
- Keep `ROBOTSTXT_OBEY = True` by default. Override only on explicitly approved targets.
- Set a descriptive `USER_AGENT` that includes project name and contact email.

Error Handling

- Handle Twisted errors in errbacks: `from twisted.internet.error import TimeoutError, ConnectionRefusedError`.
- Log failed request URLs with `spider.name`, HTTP status, and error type for post-crawl review.
- Use spider signals `spider_error` and `item_error` for centralized error aggregation.
- Log crawl stats at the end of every run via `spider_closed` signal. Alert if `item_scraped_count == 0`.
- Set `DOWNLOAD_TIMEOUT = 30` to prevent indefinite hangs on unresponsive servers.

Rate Limiting & Politeness

- Set `CONCURRENT_REQUESTS_PER_DOMAIN = 4` as the default. Lower for sensitive targets.
- Use `DOWNLOAD_DELAY` combined with `RANDOMIZE_DOWNLOAD_DELAY = True` to add jitter.
- Respect `Retry-After` headers from rate-limit responses (429) in a custom retry middleware.
- Never scrape user-generated content at rates that could disrupt the target service.

Testing

- Write spider unit tests with `scrapy.http.HtmlResponse(url, body=open('fixtures/page.html', 'rb').read())`.
- Test item loaders independently: instantiate the loader with a mock response, call `load_item()`, assert field values.
- Test pipelines by calling `pipeline.process_item(item, mock_spider)` directly. Test `DropItem` paths.
- Use `betamax` or `vcrpy` to record and replay real HTTP interactions for integration tests.
- Validate scraped output against a Pydantic schema in CI to catch selector breakage early.

Related Templates

python

Python + FastAPI

High-performance Python API development with FastAPI, Pydantic, and async patterns.

python

Python + Django

Django web development with class-based views, ORM best practices, and DRF.

python

Python + Flask

Lightweight Python web development with Flask and extensions.