API Reference
REST endpoints for crawling, agent runs, vision fallback, and live streaming. All bodies JSON.
Same image runs at https://grub.nuts.services (managed, bearer token required) and http://localhost:6792 (self-hosted Docker). Examples below use the managed URL — swap as needed.
Authentication
Bearer header with an ahp_ token from auth.nuts.services:
curl -H "Authorization: Bearer ahp_yourtoken" https://grub.nuts.services/api/crawl
Self-hosted with DISABLE_AUTH=true skips token validation — pass customer_id per-request to scope storage. See Authentication for full details.
Core crawling
| Endpoint | Purpose |
|---|---|
POST /api/crawl | Crawl a URL → HTML + markdown + screenshot + metadata |
POST /api/markdown | Markdown only — faster path for RAG pipelines; accepts urls[] for multi |
POST /api/batch | Batch crawl up to 50 URLs with bounded concurrency |
POST /api/raw | Raw rendered HTML — no markdown conversion |
GET /view?url=... | Browser-rendered HTML viewer (debugging) |
GET /download?url=... | Download a file through the crawler's browser context |
POST /api/crawl
The primary endpoint. Crawls a single URL and returns rendered HTML, clean markdown, content quality classification, block detection, and optional screenshots.
curl -s -X POST https://grub.nuts.services/api/crawl \
-H "Authorization: Bearer ahp_yourtoken" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"options": {
"javascript": true,
"screenshot": false,
"timeout": 30,
"wait_until": "domcontentloaded",
"wait_after_load_ms": 1000,
"warmup": false,
"cookies": { "session_token": "value" },
"proxy": {
"server": "http://proxy:10001",
"username": "user",
"password": "pass"
}
}
}'
Key options: javascript (JS rendering, default true), screenshot (full-page PNG), wait_until (domcontentloaded | networkidle | selector), wait_for_selector (CSS selector), warmup (visit homepage first, helps with Cloudflare), cookies (pre-solved challenge cookies), proxy (per-request override), javascript_payload (custom JS to run before capture).
Response includes success, status_code, markdown, markdown_plain, blocked, block_reason, content_quality (sufficient | minimal | empty | blocked), quarantined (true if hidden prompt injection detected), timings_ms, and screenshot_url.
POST /api/markdown
Markdown-only extraction. Same options as /api/crawl, faster and skips HTML serialization. Accepts url (single) or urls[] (up to 50) — multi returns an array in input order.
curl -s -X POST https://grub.nuts.services/api/markdown \
-H "Authorization: Bearer ahp_yourtoken" \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com", "https://httpbin.org/html"]}'
POST /api/batch
Batch crawl with bounded concurrency. concurrent caps parallel requests (default 3, range 1-10). Returns all results inline plus a summary.
curl -s -X POST https://grub.nuts.services/api/batch \
-H "Authorization: Bearer ahp_yourtoken" \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://quotes.toscrape.com/page/1/",
"https://quotes.toscrape.com/page/2/"
],
"concurrent": 3
}'
Agent + Ghost Protocol
| Endpoint | Purpose |
|---|---|
POST /api/agent/run | Autonomous LLM agent loop — plans, calls tools, iterates |
GET /api/agent/status/:run_id | Check run status or load the full trace |
POST /api/agent/ghost | Screenshot + vision-AI extraction — bypasses DOM anti-bot |
POST /api/agent/run
Submit a natural-language task. The agent plans, calls tools (crawl, click, extract), observes results, and iterates until done or a stop condition fires. Requires AGENT_ENABLED=true and at least one LLM provider key (Anthropic, OpenAI, or Ollama).
curl -s -X POST https://grub.nuts.services/api/agent/run \
-H "Authorization: Bearer ahp_yourtoken" \
-H "Content-Type: application/json" \
-d '{
"task": "Find the pricing page on example.com and extract plan details",
"max_steps": 10,
"max_wall_time_ms": 90000,
"allowed_domains": ["example.com"]
}'
Bounds: max_steps (default 12, range 1-50), max_wall_time_ms (default 90000, range 5000-300000). allowed_domains / allowed_tools restrict what the agent can touch.
Stop reasons: completed, max_steps, max_wall_time, max_failures, no_op_loop, policy_denied.
POST /api/agent/ghost
Vision fallback. When standard crawling returns blocked: true or content_quality: "empty", Ghost screenshots the page and extracts content via Claude / GPT-4o / Ollama vision — bypasses DOM-based anti-bot entirely.
curl -s -X POST https://grub.nuts.services/api/agent/ghost \
-H "Authorization: Bearer ahp_yourtoken" \
-H "Content-Type: application/json" \
-d '{
"url": "https://heavily-protected-site.com",
"prompt": "Extract the main article content"
}'
Cache
| Endpoint | Purpose |
|---|---|
POST /api/cache/search | Fuzzy TF-IDF search across cached markdown |
GET /api/cache/list | List cached document metadata; filter by domain or quality |
POST /api/cache/upsert | Insert or update a cache entry (pre-populate, or store external content) |
curl -s -X POST https://grub.nuts.services/api/cache/search \
-H "Authorization: Bearer ahp_yourtoken" \
-H "Content-Type: application/json" \
-d '{
"query": "web crawling best practices",
"domain": "example.com",
"min_similarity": 0.4,
"max_results": 10
}'
Live stream
| Endpoint | Purpose |
|---|---|
WS /stream/:session_id | Live viewport stream — receive frames, send navigate/click/scroll/type |
GET /stream/:session_id/mjpeg | MJPEG fallback for plain <img> embedding |
Requires BROWSER_STREAM_ENABLED=true and BROWSER_POOL_SIZE >= 1 on the server.
const ws = new WebSocket("wss://grub.nuts.services/stream/my-session?url=https://example.com");
ws.onmessage = (e) => {
const msg = JSON.parse(e.data);
if (msg.type === "frame") {
document.getElementById("viewport").src = "data:image/jpeg;base64," + msg.data;
}
};
ws.send(JSON.stringify({ action: "navigate", url: "https://example.com/pricing" }));
ws.send(JSON.stringify({ action: "click", selector: "#signup-btn" }));
ws.send(JSON.stringify({ action: "scroll", direction: "down" }));
ws.send(JSON.stringify({ action: "type", selector: "#search", text: "hello" }));
Mesh
| Endpoint | Purpose |
|---|---|
GET /mesh/peers | Known peers, health, and load |
GET /mesh/status | This node's status and peer count |
POST /mesh/execute | Execute a tool on a remote node (1-hop max) |
System & MCP
| Endpoint | Purpose |
|---|---|
GET /health | Service status, version, tool count, mesh info |
GET /tools | All registered AHP tools |
GET /mcp/ | MCP streamable-http endpoint (trailing slash required) |
MCP setup
Add as an MCP server in Claude Code:
claude mcp add grubcrawler https://grub.nuts.services/mcp/ \
--transport http \
--header "Authorization: Bearer ahp_yourtoken"
Exposes grub_crawl, grub_screenshot, grub_diagnose as MCP tools. Note the trailing slash on /mcp/ — without it FastAPI returns 404.
Content quality classification
Every crawl response includes a content_quality field:
| Quality | Criteria | Action |
|---|---|---|
sufficient | ≥ 600 chars AND ≥ 120 words | Safe for RAG / summarization |
minimal | < 600 chars OR < 120 words | Retry or use Ghost Protocol |
empty | < 80 chars OR < 15 words | Page didn't render — retry with JS |
blocked | Anti-bot signal detected | Ghost Protocol or add proxy/stealth |
Prompt injection defense: When quarantined: true, the extractor detected hidden instruction-like text (common in .sr-only abuse). Content is blanked fail-closed. Do not summarize quarantined content — hidden text may contain prompt-injection payloads targeting downstream LLMs.
Running with DISABLE_AUTH=true? Pass customer_id in every request body to scope storage. See Authentication for full token flows.