Guide: Web crawling and sitemaps
X07 supports deterministic crawler logic by splitting pure scheduling/parse code from OS-bound fetching:
ext-web-crawl(pinned byarch/crawl/) provides:- robots.txt parsing (
std.crawl.robots) per RFC 9309 - sitemap parsing (
std.crawl.sitemap) - URL canonicalization (
std.crawl.urlnorm) - deterministic crawl scheduling (
std.crawl.schedule)
- robots.txt parsing (
- Fetch is OS-bound, but RR-friendly:
std.crawl.fetch.replay_rr_v1(...)(pure replay from an rr cassette entry)std.crawl.fetch.os.run_rr_v1(...)/std.crawl.fetch.os.run_rr_missing_v1(...)(OS wrappers for record/replay)
Canonical workflow
- Keep crawl planning and parsing in pure modules (
solve-pure). - Record real fetch results once under an rr policy like
crawl_rr_v1and keep the cassette under.x07_rr/. - Replay fetch results deterministically in
solve-rrusingstd.crawl.fetch.replay_rr_v1.
x07 arch check validates pinned contracts under arch/crawl/** and enforces the world split (no OS fetch imports in solve worlds).
Adding the crawler package
Use the capability map (web.crawl) and sync the lockfile:
# Pick NAME@VERSION from /agent/latest/catalog/capabilities.json.
x07 pkg add NAME@VERSION --sync
Expert (operations)
- Treat robots rules as a crawler behavior contract, not access control; apply your own auth and allowlists separately.
- Prefer
run-os-sandboxedfor fetch recording runs; keep cassettes deterministic and review sanitizers for any sensitive headers.