Web Scraper¶

Fetches and aggregates content from one or more web pages and optionally answers user-provided queries based on the scraped content. Supports limited recursive crawling with a configurable depth. Returns a readable Q&A summary string and a JSON-formatted string of scraped documents.

Usage¶

Use this node to gather information from specified URLs and, if desired, generate answers to specific questions using the scraped content. Typical workflows include research, summarization, and preparing references: provide URLs, set a crawl depth (0–3), optionally include queries, and consume the returned answers and raw documents.

Inputs¶

Field	Required	Type	Description	Example
urls	True	STRING	One or multiple starting URLs to scrape. Separate multiple URLs with line breaks. At least one non-empty URL is required.	https://example.com https://docs.python.org
max_depth	True	INT	Maximum recursive scraping depth. 0 scrapes only the provided URLs; higher values follow in-page links up to the given depth (max 3). Larger pages or deeper crawls may increase processing time.	1
queries	True	STRING	Optional questions to answer using the scraped content. Separate multiple queries with line breaks. Can be left empty to only return documents.	What are the key features? Provide a short summary of each page.

Outputs¶

Field	Type	Description	Example
answers	STRING	A formatted string containing each query and its corresponding answer derived from the scraped pages. If a query failed, the output includes a note indicating it was skipped due to an exception.	Query What are the key features? Answer The site highlights A, B, and C as core features.
documents	STRING	A JSON-formatted string of the scraped documents. Each entry contains the page name and content; error entries include a note if a page was skipped due to an exception.	[ { "name": "Example Domain", "content": "This domain is for use in illustrative examples..." } ]

Important Notes¶

Depth 0 scrapes only the provided URLs; increasing depth follows links and can significantly increase runtime.
Large or complex pages and higher depths may cause processing to approach the node timeout (300 seconds).
At least one URL is required; queries can be left empty if you only need raw documents.
Answers are produced only when queries are provided. If a query fails, the answer text will indicate it was skipped due to an exception.
Documents output is a JSON string; you may need to parse it in downstream nodes to access individual entries.

Troubleshooting¶

If you see 'Request to Agents service failed', verify network connectivity and that the target service is available, then retry.
If the node times out, reduce max_depth, decrease the number of URLs, or choose smaller/lighter pages.
If documents is empty or minimal, ensure the URLs are valid and publicly accessible (no authentication or heavy client-side rendering required).
If answers are empty, provide one or more queries or check the documents output to confirm scraping succeeded.
If the documents string indicates pages were skipped due to exceptions, try the URLs individually, reduce depth, or remove problematic links.