DeepBlue Dynamics / Docs / GrubCrawler / API Reference

API Reference

REST endpoints for crawling, agent runs, vision fallback, and live streaming. All bodies JSON.

Same image runs at https://grub.nuts.services (managed, bearer token required) and http://localhost:6792 (self-hosted Docker). Examples below use the managed URL — swap as needed.

Authentication

Bearer header with an ahp_ token from auth.nuts.services:

curl -H "Authorization: Bearer ahp_yourtoken" https://grub.nuts.services/api/crawl

Self-hosted with DISABLE_AUTH=true skips token validation — pass customer_id per-request to scope storage. See Authentication for full details.

Core crawling

Endpoint	Purpose
`POST /api/crawl`	Crawl a URL → HTML + markdown + screenshot + metadata
`POST /api/markdown`	Markdown only — faster path for RAG pipelines; accepts `urls[]` for multi
`POST /api/batch`	Batch crawl up to 50 URLs with bounded concurrency
`POST /api/raw`	Raw rendered HTML — no markdown conversion
`GET /view?url=...`	Browser-rendered HTML viewer (debugging)
`GET /download?url=...`	Download a file through the crawler's browser context

POST /api/crawl

The primary endpoint. Crawls a single URL and returns rendered HTML, clean markdown, content quality classification, block detection, and optional screenshots.

curl -s -X POST https://grub.nuts.services/api/crawl \
  -H "Authorization: Bearer ahp_yourtoken" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "options": {
      "javascript": true,
      "screenshot": false,
      "timeout": 30,
      "wait_until": "domcontentloaded",
      "wait_after_load_ms": 1000,
      "warmup": false,
      "cookies": { "session_token": "value" },
      "proxy": {
        "server": "http://proxy:10001",
        "username": "user",
        "password": "pass"
      }
    }
  }'

Key options: javascript (JS rendering, default true), screenshot (full-page PNG), wait_until (domcontentloaded | networkidle | selector), wait_for_selector (CSS selector), warmup (visit homepage first, helps with Cloudflare), cookies (pre-solved challenge cookies), proxy (per-request override), javascript_payload (custom JS to run before capture).

Response includes success, status_code, markdown, markdown_plain, blocked, block_reason, content_quality (sufficient | minimal | empty | blocked), quarantined (true if hidden prompt injection detected), timings_ms, and screenshot_url.

POST /api/markdown

Markdown-only extraction. Same options as /api/crawl, faster and skips HTML serialization. Accepts url (single) or urls[] (up to 50) — multi returns an array in input order.

curl -s -X POST https://grub.nuts.services/api/markdown \
  -H "Authorization: Bearer ahp_yourtoken" \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com", "https://httpbin.org/html"]}'

POST /api/batch

Batch crawl with bounded concurrency. concurrent caps parallel requests (default 3, range 1-10). Returns all results inline plus a summary.

curl -s -X POST https://grub.nuts.services/api/batch \
  -H "Authorization: Bearer ahp_yourtoken" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://quotes.toscrape.com/page/1/",
      "https://quotes.toscrape.com/page/2/"
    ],
    "concurrent": 3
  }'

Agent + Ghost Protocol

Endpoint	Purpose
`POST /api/agent/run`	Autonomous LLM agent loop — plans, calls tools, iterates
`GET /api/agent/status/:run_id`	Check run status or load the full trace
`POST /api/agent/ghost`	Screenshot + vision-AI extraction — bypasses DOM anti-bot

POST /api/agent/run

Submit a natural-language task. The agent plans, calls tools (crawl, click, extract), observes results, and iterates until done or a stop condition fires. Requires AGENT_ENABLED=true and at least one LLM provider key (Anthropic, OpenAI, or Ollama).

curl -s -X POST https://grub.nuts.services/api/agent/run \
  -H "Authorization: Bearer ahp_yourtoken" \
  -H "Content-Type: application/json" \
  -d '{
    "task": "Find the pricing page on example.com and extract plan details",
    "max_steps": 10,
    "max_wall_time_ms": 90000,
    "allowed_domains": ["example.com"]
  }'

Bounds: max_steps (default 12, range 1-50), max_wall_time_ms (default 90000, range 5000-300000). allowed_domains / allowed_tools restrict what the agent can touch.

Stop reasons: completed, max_steps, max_wall_time, max_failures, no_op_loop, policy_denied.

POST /api/agent/ghost

Vision fallback. When standard crawling returns blocked: true or content_quality: "empty", Ghost screenshots the page and extracts content via Claude / GPT-4o / Ollama vision — bypasses DOM-based anti-bot entirely.

curl -s -X POST https://grub.nuts.services/api/agent/ghost \
  -H "Authorization: Bearer ahp_yourtoken" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://heavily-protected-site.com",
    "prompt": "Extract the main article content"
  }'

Cache

Endpoint	Purpose
`POST /api/cache/search`	Fuzzy TF-IDF search across cached markdown
`GET /api/cache/list`	List cached document metadata; filter by domain or quality
`POST /api/cache/upsert`	Insert or update a cache entry (pre-populate, or store external content)

curl -s -X POST https://grub.nuts.services/api/cache/search \
  -H "Authorization: Bearer ahp_yourtoken" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "web crawling best practices",
    "domain": "example.com",
    "min_similarity": 0.4,
    "max_results": 10
  }'

Live stream

Endpoint	Purpose
`WS /stream/:session_id`	Live viewport stream — receive frames, send navigate/click/scroll/type
`GET /stream/:session_id/mjpeg`	MJPEG fallback for plain `<img>` embedding

Requires BROWSER_STREAM_ENABLED=true and BROWSER_POOL_SIZE >= 1 on the server.

const ws = new WebSocket("wss://grub.nuts.services/stream/my-session?url=https://example.com");

ws.onmessage = (e) => {
  const msg = JSON.parse(e.data);
  if (msg.type === "frame") {
    document.getElementById("viewport").src = "data:image/jpeg;base64," + msg.data;
  }
};

ws.send(JSON.stringify({ action: "navigate", url: "https://example.com/pricing" }));
ws.send(JSON.stringify({ action: "click",    selector: "#signup-btn" }));
ws.send(JSON.stringify({ action: "scroll",   direction: "down" }));
ws.send(JSON.stringify({ action: "type",     selector: "#search", text: "hello" }));

Mesh

Endpoint	Purpose
`GET /mesh/peers`	Known peers, health, and load
`GET /mesh/status`	This node's status and peer count
`POST /mesh/execute`	Execute a tool on a remote node (1-hop max)

System & MCP

Endpoint	Purpose
`GET /health`	Service status, version, tool count, mesh info
`GET /tools`	All registered AHP tools
`GET /mcp/`	MCP streamable-http endpoint (trailing slash required)

MCP setup

Add as an MCP server in Claude Code:

claude mcp add grubcrawler https://grub.nuts.services/mcp/ \
  --transport http \
  --header "Authorization: Bearer ahp_yourtoken"

Exposes grub_crawl, grub_screenshot, grub_diagnose as MCP tools. Note the trailing slash on /mcp/ — without it FastAPI returns 404.

Content quality classification

Every crawl response includes a content_quality field:

Quality	Criteria	Action
`sufficient`	≥ 600 chars AND ≥ 120 words	Safe for RAG / summarization
`minimal`	< 600 chars OR < 120 words	Retry or use Ghost Protocol
`empty`	< 80 chars OR < 15 words	Page didn't render — retry with JS
`blocked`	Anti-bot signal detected	Ghost Protocol or add proxy/stealth

Prompt injection defense: When quarantined: true, the extractor detected hidden instruction-like text (common in .sr-only abuse). Content is blanked fail-closed. Do not summarize quarantined content — hidden text may contain prompt-injection payloads targeting downstream LLMs.

Running with DISABLE_AUTH=true? Pass customer_id in every request body to scope storage. See Authentication for full token flows.