Web Scraper¶

Fetches and parses web pages starting from one or more URLs and (optionally) answers user-provided queries based on the scraped content. It sends a request to the Agents service, handles errors, and returns a formatted Q&A summary plus the raw scraped documents as a JSON string. Depth-limited recursive scraping is supported up to a maximum depth of 3.

Usage¶

Use this node when you need to extract content from web pages and optionally get concise answers to specific questions about that content. Provide one or more URLs (newline-separated), choose a recursion depth for following links, and optionally add queries (newline-separated). The node returns a human-readable answers summary and a JSON string containing all scraped documents.

Inputs¶

Field	Required	Type	Description	Example
urls	True	STRING	One or multiple starting URLs to scrape. Separate multiple URLs with line breaks. At least one non-empty URL is required.	https://example.com https://example.org/articles/123
max_depth	True	INT	Maximum recursion depth for following links from the provided pages. At depth 0, only the provided URLs are scraped. Valid range: 0–3.	1
queries	True	STRING	Optional questions to answer based on the scraped content. Separate multiple queries with line breaks. Can be left empty to only retrieve documents.	What is the main topic of the page? List the key takeaways.

Outputs¶

Field	Type	Description	Example
answers	STRING	A formatted text summary containing each query and its corresponding answer. If a query fails, the response includes a note indicating it was skipped due to an exception.	Query What is the main topic? Answer The page discusses example domain usage and purpose.
documents	STRING	A JSON-formatted string representing the list of scraped documents. Each document includes fields such as name and content. If a page failed to load, its content explains the exception.	[ { "name": "https://example.com", "content": "Example Domain ..." }, { "name": "Error", "content": "Page was skipped due to exception:\n" } ]

Important Notes¶

External service dependency: This node calls the Agents service endpoint at agents/web_scrape. Network connectivity and correct service configuration are required.
Timeout: Requests time out after 300 seconds. Large pages or high recursion depth can increase processing time.
Depth limit: max_depth is capped at 3. Depth 0 scrapes only the provided URLs.
Input formatting: Provide multiple URLs or queries as separate lines. At least one URL is required; queries can be empty.
Error propagation: Non-200 responses or JSON parsing issues will raise an exception. Failed pages or queries are returned with explanatory messages in the outputs.

Troubleshooting¶

Non-200 or request failure: Verify the URLs are valid and reachable, and ensure the Agents service endpoint is configured and accessible. Try reducing the number of URLs.
Timeouts: Reduce max_depth, provide fewer/lighter pages, or try again when the network is stable. Very large or complex pages increase processing time.
Empty answers: If no queries are provided, the answers output will be empty or minimal by design. Add queries to receive Q&A output.
Malformed documents JSON: The documents output is a JSON string. If downstream nodes fail to parse it, ensure they expect a JSON string and not a Python object.
Unexpected HTML or proxy pages: If JSON decoding errors occur, the service may be returning an HTML error page (e.g., proxy/auth). Check service URL configuration and authentication requirements.